JBoss AS 7 Infinispan Cluster - spring

I have a two node JBoss AS 7.1.1.FINAL cluster setup in the following way -
master - running on Ubuntu Server 12.10 (VirtualBox VM)
slave - running on Windows 7 (VirtaulBox host machine)
I have deployed a Spring web application on both nodes and I'm trying to set up a working replicated cache. My problem is that the cache does not seem to be replicated even though the clustering apparently works.
My config -
in domain.xml (both on master and slave)
<subsystem xmlns="urn:jboss:domain:infinispan:1.2" default-cache-container="cluster">
<cache-container name="cluster" aliases="ha-partition" default-cache="default" jndi-name="java:jboss/infinispan/cluster" start="EAGER">
<transport lock-timeout="60000" />
<replicated-cache name="default" mode="SYNC" batching="true">
<locking isolation="REPEATABLE_READ"/>
</replicated-cache>
</cache-container>
</subsystem>
This is pretty much the default config in domain.xml, except for the jndi-name and the EAGER start.
In spring configuration -
<infinispan:container-cache-manager id="cacheManager" cache-container-ref="springCacheContainer" />
<jee:jndi-lookup id="springCacheContainer" jndi-name="java:jboss/infinispan/cluster" />
With this set up, the caching works, but its not replicated. The caches seem to operate independently of each other. Also, the EAGER start seems to have no effect. The caches seem to be initialized only when they are first used.
from master log (first time cache is used)-
[Server:server-one] 03:25:55,756 INFO [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (ajp-192.168.2.13-192.168.2.13-8009-3) ISPN000078: Starting JGroups Channel
[Server:server-one] 03:25:55,762 INFO [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (ajp-192.168.2.13-192.168.2.13-8009-3) ISPN000094: Received new cluster view: [master:server-one/cluster|1] [master:server-one/cluster, slave:server-one-slave/cluster]
[Server:server-one] 03:25:55,763 INFO [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (ajp-192.168.2.13-192.168.2.13-8009-3) ISPN000079: Cache local address is master:server-one/cluster, physical addresses are [192.168.2.13:55200]
[Server:server-one] 03:25:55,769 INFO [org.infinispan.factories.GlobalComponentRegistry] (ajp-192.168.2.13-192.168.2.13-8009-3) ISPN000128: Infinispan version: Infinispan 'Brahma' 5.1.2.FINAL
[Server:server-one] 03:25:55,851 INFO [org.jboss.as.clustering.infinispan] (ajp-192.168.2.13-192.168.2.13-8009-3) JBAS010281: Started cluster cache from cluster container
from slave log (first time cache is used)-
[Server:server-one-slave] 03:29:38,124 INFO [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (ajp--192.168.2.10-8009-2) ISPN000078: Starting JGroups Channel
[Server:server-one-slave] 03:29:38,129 INFO [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (ajp--192.168.2.10-8009-2) ISPN000094: Received new cluster view: [master:server-one/cluster|1] [master:server-one/cluster, slave:server-one-slave/cluster]
[Server:server-one-slave] 03:29:38,130 INFO [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (ajp--192.168.2.10-8009-2) ISPN000079: Cache local address is slave:server-one-slave/cluster, physical addresses are [192.168.2.10:55200]
[Server:server-one-slave] 03:29:38,133 INFO [org.infinispan.factories.GlobalComponentRegistry] (ajp--192.168.2.10-8009-2) ISPN000128: Infinispan version: Infinispan 'Brahma' 5.1.2.FINAL
[Server:server-one-slave] 03:29:38,195 INFO [org.jboss.as.clustering.infinispan] (ajp--192.168.2.10-8009-2) JBAS010281: Started cluster cache from cluster container
I don't think this is a udp/multicast issue, as I have mod_cluster, HornetQ and Quartz set up in this cluster and they all work as expected.

Putting <distributable/> in web.xml did the trick.

I had a similar issue where my cache wouldn't replicate until the application was first used. I was able to resolve this by setting the "start" attribute of the replicated-cache to EAGER, along with the cache-container attribute start="EAGER".
<replicated-cache name="default" mode="SYNC" batching="true" start="EAGER">

Related

NiFi Cluster setup

Please help me to complete Nifi cluster setup. I can see Nifi is running on server, but GUI is not coming.
Java version:
openjdk version "1.8.0_302"
OpenJDK Runtime Environment (build 1.8.0_302-b08)
OpenJDK 64-Bit Server VM (build 25.302-b08, mixed mode)
NiFi version: nifi-1.17.0
Nifi.properties:
nifi.state.management.embedded.zookeeper.start=true
nifi.remote.input.host=Svxxx.xyz.com
nifi.remote.input.secure=true
nifi.remote.input.socket.port=10443
nifi.remote.input.http.enabled=true
nifi.web.https.host=Svxxx.xyz.com
nifi.web.https.port=9443
nifi.web.proxy.host=localhost:9443,Svxxx.xyz.com:9443
nifi.sensitive.props.key=propkeywith12chars
nifi.cluster.is.node=true
nifi.cluster.node.address=Svxxx.xyz.com
nifi.cluster.node.protocol.port=11443
nifi.cluster.load.balance.host=Svxxx.xyz.com
nifi.cluster.load.balance.port=6342
nifi.zookeeper.connect.string=Svxxx.xyz.com:2181,Svxxx.xyz.com:2181,Svxxx.xyz.com:2181
zookeeper. properties:
server.1=Svxxx.xyz.com:2888:3888;2181
server.2=Svxxx.xyz.com:2888:3888;2181
server.3=Svxxx.xyz.com:2888:3888;2181
Changes made in state-management.xml:
<cluster-provider>
<id>zk-provider</id>
<class>org.apache.nifi.controller.state.providers.zookeeper.ZooKeeperStateProvider</class>
<property name="Connect String">Svxxx.xyz.com:2181,Svxxx.xyz.com:2181,Svxxx.xyz.com:2181</property>
<property name="Root Node">/nifi</property>
<property name="Session Timeout">10 seconds</property>
<property name="Access Control">Open</property>
</cluster-provider>
Firewall status: disabled
Also created SSL certificate using toolkit, and put those on respective servers. Replaced truststore.jks and keystore.jks also accordingly.
nifi-app.log:
2022-10-17 22:00:05,669 WARN [main] o.a.nifi.controller.StandardFlowService There is currently no Cluster Coordinator. This often happens upon restart of NiFi when running an embedded ZooKeeper. Will register this node to become the active Cluster Coordinator and will attempt to connect to cluster again
2022-10-17 22:00:05,670 INFO [main] o.a.n.c.l.e.CuratorLeaderElectionManager CuratorLeaderElectionManager[stopped=false] Attempted to register Leader Election for role 'Cluster Coordinator' but this role is already registered
2022-10-17 22:00:11,403 WARN [Heartbeat Monitor Thread-1] o.a.n.c.l.e.CuratorLeaderElectionManager Unable to determine leader for role 'Cluster Coordinator'; returning null
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /nifi/leaders/Cluster Coordinator
at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:2480)
at org.apache.curator.framework.imps.GetChildrenBuilderImpl$3.call(GetChildrenBuilderImpl.java:242)
at org.apache.curator.framework.imps.GetChildrenBuilderImpl$3.call(GetChildrenBuilderImpl.java:231)
at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:93)
at org.apache.curator.framework.imps.GetChildrenBuilderImpl.pathInForeground(GetChildrenBuilderImpl.java:228)
at org.apache.curator.framework.imps.GetChildrenBuilderImpl.forPath(GetChildrenBuilderImpl.java:219)
at org.apache.curator.framework.imps.GetChildrenBuilderImpl.forPath(GetChildrenBuilderImpl.java:41)
at org.apache.curator.framework.recipes.locks.LockInternals.getSortedChildren(LockInternals.java:154)
at org.apache.curator.framework.recipes.locks.LockInternals.getParticipantNodes(LockInternals.java:134)
at org.apache.curator.framework.recipes.locks.InterProcessMutex.getParticipantNodes(InterProcessMutex.java:170)
at org.apache.curator.framework.recipes.leader.LeaderSelector.getLeader(LeaderSelector.java:337)
at org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager.getLeader(CuratorLeaderElectionManager.java:281)
at org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener.verifyLeader(CuratorLeaderElectionManager.java:571)
at org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener.isLeader(CuratorLeaderElectionManager.java:525)
at org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$LeaderRole.isLeader(CuratorLeaderElectionManager.java:466)
at org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager.isLeader(CuratorLeaderElectionManager.java:262)
at org.apache.nifi.cluster.coordination.node.NodeClusterCoordinator.isActiveClusterCoordinator(NodeClusterCoordinator.java:824)
at org.apache.nifi.cluster.coordination.heartbeat.AbstractHeartbeatMonitor.monitorHeartbeats(AbstractHeartbeatMonitor.java:132)
at org.apache.nifi.cluster.coordination.heartbeat.AbstractHeartbeatMonitor$1.run(AbstractHeartbeatMonitor.java:84)
at org.apache.nifi.engine.FlowEngine$2.run(FlowEngine.java:110)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2022-10-17 22:00:12,371 INFO [Cleanup Archive for default] o.a.n.c.repository.FileSystemRepository Successfully deleted 0 files (0 bytes) from archive
2022-10-17 22:00:12,371 INFO [Cleanup Archive for default] o.a.n.c.repository.FileSystemRepository Archive cleanup completed for container default; will now allow writing to this container. Bytes used = 10.53 GB, bytes free = 26.46 GB, capacity = 36.99 GB
2022-10-17 22:00:14,203 INFO [Write-Ahead Local State Provider Maintenance] org.wali.MinimalLockingWriteAheadLog org.wali.MinimalLockingWriteAheadLog#22e813fc checkpointed with 1 Records and 0 Swap Files in 5 milliseconds (Stop-the-world time = 2 milliseconds, Clear Edit Logs time = 2 millis), max Transaction ID 3
2022-10-17 22:00:18,823 WARN [main] o.a.n.c.l.e.CuratorLeaderElectionManager Unable to determine leader for role 'Cluster Coordinator'; returning null
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /nifi/leaders/Cluster Coordinator

High CPU usage on idle AMQ Artemis cluster, related to locks with shared-store HA

I have AMQ Artemis cluster, shared-store HA (master-slave), 2.17.0.
I noticed that all my clusters (active servers only) that are idle (no one is using them) using from 10% to 20% of CPU, except one, which is using around 1% (totally normal). I started investigating...
Long story short - only one cluster has a completely normal CPU usage. The only difference I've managed to find that if I connect to that normal cluster's master node and attempt telnet slave 61616 - it will show as connected. If I do the same in any other cluster (that has high CPU usage) - it will show as rejected.
In order to better understand what is happening, I enabled DEBUG logs in instance/etc/logging.properties. Here is what master node is spamming:
2021-05-07 13:54:31,857 DEBUG [org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl] Backup is not active, trying original connection configuration now.
2021-05-07 13:54:32,357 DEBUG [org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl] Trying reconnection attempt 0/1
2021-05-07 13:54:32,357 DEBUG [org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl] Trying to connect with connectorFactory = org.apache.activemq.artemis.core.remoting.impl.netty$NettyConnectorFactory#6cf71172, connectorConfig=TransportConfiguration(name=slave-connector, factory=org-apache-activemq-artemis-core-remoting-impl-netty-NettyConnectorFactory) ?trustStorePassword=****&port=61616&keyStorePassword=****&sslEnabled=true&host=slave-com&trustStorePath=/path/to/ssl/truststore-jks&keyStorePath=/path/to/ssl/keystore-jks
2021-05-07 13:54:32,357 DEBUG [org.apache.activemq.artemis.core.remoting.impl.netty.NettyConnector] Connector NettyConnector [host=slave.com, port=61616, httpEnabled=false$ httpUpgradeEnabled=false, useServlet=false, servletPath=/messaging/ActiveMQServlet, sslEnabled=true, useNio=true] using native epoll
2021-05-07 13:54:32,357 DEBUG [org.apache.activemq.artemis.core.client] AMQ211002: Started EPOLL Netty Connector version 4.1.51.Final to slave.com:61616
2021-05-07 13:54:32,358 DEBUG [org.apache.activemq.artemis.core.remoting.impl.netty.NettyConnector] Remote destination: slave.com/123.123.123.123:61616
2021-05-07 13:54:32,358 DEBUG [org.apache.activemq.artemis.spi.core.remoting.ssl.SSLContextFactory] Creating SSL context with configuration
trustStorePassword=****
port=61616
keyStorePassword=****
sslEnabled=true
host=slave.com
trustStorePath=/path/to/ssl/truststore.jks
keyStorePath=/path/to/ssl/keystore.jks
2021-05-07 13:54:32,448 DEBUG [org.apache.activemq.artemis.core.remoting.impl.netty.NettyConnector] Added ActiveMQClientChannelHandler to Channel with id = 77c078c2
2021-05-07 13:54:32,448 DEBUG [org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl] Connector towards NettyConnector [host=slave.com, port=61616, httpEnabled=false, httpUpgradeEnabled=false, useServlet=false, servletPath=/messaging/ActiveMQServlet, sslEnabled=true, useNio=true] failed
This is what slave is spamming:
2021-05-07 14:06:53,177 DEBUG [org.apache.activemq.artemis.core.server.impl.FileLockNodeManager] trying to lock position: 1
2021-05-07 14:06:53,178 DEBUG [org.apache.activemq.artemis.core.server.impl.FileLockNodeManager] failed to lock position: 1
If I attempt to telnet from master node to slave node (same if I do it from slave to slave):
[root#master]# telnet slave.com 61616
Trying 123.123.123.123...
telnet: connect to address 123.123.123.123: Connection refused
However if I attempt the same telnet in that the only working cluster, I can successfully "connect" from master to slave...
Here is what I suspect:
Master acquires lock in instance/data/journal/server.lock
Master keeps trying to connect to slave server
Slave unable to start, because it cannot acquire the same server.lock on shared storage.
Master uses high CPU because of such hard-trying to connect to slave, which is not running.
What am I doing wrong?
EDIT: This is how my NFS mounts look like (taken from mount command):
some_server:/some_dir on /path/to/artemis/instance/data type nfs4 (rw,relatime,sync,vers=4.1,rsize=65536,wsize=65536,namlen=255,acregmin=0,acregmax=0,acdirmin=0,acdirmax=0,soft,noac,proto=tcp,timeo=50,retrans=1,sec=sys,clientaddr=123.123.123.123,local_lock=none,addr=123.123.123.123)
Turns out issue was in broker.xml configuration. In static-connectors I somehow decided to list only a "non-current server" (e.g. I have srv0 and srv1 - in srv0 I only added connector of srv1 and vice versa).
What it used to be (on 1st master node):
<cluster-connections>
<cluster-connection name="abc">
<connector-ref>srv0-connector</connector-ref>
<message-load-balancing>ON_DEMAND</message-load-balancing>
<max-hops>1</max-hops>
<static-connectors>
<connector-ref>srv1-connector</connector-ref>
</static-connectors>
</cluster-connection>
</cluster-connections>
How it is now (on 1st master node):
<cluster-connections>
<cluster-connection name="abc">
<connector-ref>srv0-connector</connector-ref>
<message-load-balancing>ON_DEMAND</message-load-balancing>
<max-hops>1</max-hops>
<static-connectors>
<connector-ref>srv0-connector</connector-ref>
<connector-ref>srv1-connector</connector-ref>
</static-connectors>
</cluster-connection>
</cluster-connections>
After listing all cluster's nodes, the CPU normalized and it's not only ~1% on active node. The issue is totally not related AMQ Artemis connections spamming or file locks.

Unable to get NiFi to work in cluster

Based on all the examples I've read I thought setting up NiFi as a cluster would be easy. Apparently I can't get it to work. I'm using NiFi 1.5. I only have 2 host and pretended that there was a third, but NiFi is not starting as a cluster. These are the changes I've made to the config files.
state-management.xml file
<cluster-provider>
<id>zk-provider</id>
<class>org.apache.nifi.controller.state.providers.zookeeper.ZooKeeperStateProvider</class>
<property name="Connect String">etl-1:2181,etl-2:2181,etl-3:2181</property>
<property name="Root Node">/ssd/nifi-1.5.0</property>
<property name="Session Timeout">10 seconds</property>
<property name="Access Control">Open</property>
</cluster-provider>
zookeeper.properities file
dataDir=./state/zookeeper
server.1=etl-1:2888:3888
server.2=etl-2:2888:3888
server.3=etl-3:2888:3888
nifi.properties file
nifi.zookeeper.connect.string=etl-1:2181,etl-2:2181,etl-3:2181
nifi.zookeeper.root.node=/ssd/nifi-1.5.0
nifi.cluster.is.node=yes
nifi.cluster.node.address=etl-2 (this is set to etl-1 on the other node)
nifi.state.management.embedded.zookeeper.start=true
[nifi#etl-2 zookeeper]$ cat /ssd/nifi-1.5.0/state/zookeeper/myid
2
[nifi#etl-1 logs]$ cat /ssd/nifi-1.5.0/state/zookeeper/myid
1
I've updated logback.xml to DEBUG, but there are so many message I can't seem to find out's what wrong. My best guess is that zookeeper in starting in local instead of cluster.
ls -l /ssd/nifi-1.5.0/state/
total 4
drwxrwxr-x. 18 nifi nifi 4096 Feb 23 18:49 local
drwxrwxr-x. 2 nifi nifi 18 Feb 23 15:26 zookeeper
I found the problem. I had nifi.cluster.is.node=yes instead of nifi.cluster.is.node=true.
Am trying to run Nifi as a 3 node cluster on Windows 10, with same properties set in all 3 conf files as mentioned above, but still its failing to start nifi as a 3 node cluster.
Below are the settings done on my files
state-management.xml file
<cluster-provider>
<id>zk-provider</id>
<class>org.apache.nifi.controller.state.providers.zookeeper.ZooKeeperStateProvider</class>
<property name="Connect String">node1:2181,node2:2181,node3:2181</property>
<property name="Root Node">/nifi</property>
<property name="Session Timeout">10 seconds</property>
<property name="Access Control">Open</property>
</cluster-provider>
zookeper properties
dataDir=./state/zookeeper
autopurge.snapRetainCount=30
server.1=node1:2888:3888
server.2=node2:2888:3888
server.3=node3:2888:3888
nifi properties:
nifi.web.http.host=node1
nifi.web.http.port=8080
nifi.cluster.is.node=true
nifi.cluster.node.address=node1
nifi.cluster.node.protocol.port=11122
nifi.zookeeper.connect.string=node1:2181,node2:2181,node3:2181
nifi.zookeeper.root.node=/nifi
In other 2 noes, the host name are placed respectively
But when i start the nodes, it displays the below error
C:\nifi-1.11.0-bin\Node1\bin>run-nifi.bat
2020-02-11 20:29:20,237 INFO [main] org.apache.nifi.bootstrap.Command Starting Apache NiFi...
2020-02-11 20:29:20,238 INFO [main] org.apache.nifi.bootstrap.Command Working Directory: C:\NIFI-1~1.0-B\Node1
2020-02-11 20:29:20,239 INFO [main] org.apache.nifi.bootstrap.Command Command: C:\Program Files\Java\jdk1.8.0_241\bin\java.exe -classpath C:\NIFI-1~1.0-B\Node1.\conf;C:\NIFI-1~1.0-B\Node1.\lib\javax.servlet-api-3.1.0.jar;C:\NIFI-1~1.0-B\Node1.\lib\jcl-over-slf4j-1.7.26.jar;C:\NIFI-1~1.0-B\Node1.\lib\jetty-schemas-3.1.jar;C:\NIFI-1~1.0-B\Node1.\lib\jul-to-slf4j-1.7.26.jar;C:\NIFI-1~1.0-B\Node1.\lib\log4j-over-slf4j-1.7.26.jar;C:\NIFI-1~1.0-B\Node1.\lib\logback-classic-1.2.3.jar;C:\NIFI-1~1.0-B\Node1.\lib\logback-core-1.2.3.jar;C:\NIFI-1~1.0-B\Node1.\lib\nifi-api-1.11.0.jar;C:\NIFI-1~1.0-B\Node1.\lib\nifi-framework-api-1.11.0.jar;C:\NIFI-1~1.0-B\Node1.\lib\nifi-nar-utils-1.11.0.jar;C:\NIFI-1~1.0-B\Node1.\lib\nifi-properties-1.11.0.jar;C:\NIFI-1~1.0-B\Node1.\lib\nifi-runtime-1.11.0.jar;C:\NIFI-1~1.0-B\Node1.\lib\slf4j-api-1.7.26.jar -Dorg.apache.jasper.compiler.disablejsr199=true -Xmx512m -Xms512m -Djavax.security.auth.useSubjectCredsOnly=true -Djava.security.egd=file:/dev/urandom -Dsun.net.http.allowRestrictedHeaders=true -Djava.net.preferIPv4Stack=true -Djava.awt.headless=true -Djava.protocol.handler.pkgs=sun.net.www.protocol -Dzookeeper.admin.enableServer=false -Dnifi.properties.file.path=C:\NIFI-1~1.0-B\Node1.\conf\nifi.properties -Dnifi.bootstrap.listen.port=53535 -Dapp=NiFi -Dorg.apache.nifi.bootstrap.config.log.dir=C:\NIFI-1~1.0-B\Node1\bin..\logs org.apache.nifi.NiFi
2020-02-11 20:29:20,554 WARN [main] org.apache.nifi.bootstrap.Command Failed to set permissions so that only the owner can read pid file C:\NIFI-1~1.0-B\Node1\bin..\run\nifi.pid; this may allows others to have access to the key needed to communicate with NiFi. Permissions should be changed so that only the owner can read this file
2020-02-11 20:29:20,561 WARN [main] org.apache.nifi.bootstrap.Command Failed to set permissions so that only the owner can read status file C:\NIFI-1~1.0-B\Node1\bin..\run\nifi.status; this may allows others to have access to the key needed to communicate with NiFi. Permissions should be changed so that only the owner can read this file
2020-02-11 20:29:20,573 INFO [main] org.apache.nifi.bootstrap.Command Launched Apache NiFi with Process ID 4284
C:\nifi-1.11.0-bin\Node1\bin>status-nifi.bat
20:31:57.265 [main] DEBUG org.apache.nifi.bootstrap.NotificationServiceManager - Found 0 service elements
20:31:57.271 [main] INFO org.apache.nifi.bootstrap.NotificationServiceManager - Successfully loaded the following 0 services: []
20:31:57.272 [main] INFO org.apache.nifi.bootstrap.RunNiFi - Registered no Notification Services for Notification Type NIFI_STARTED
20:31:57.273 [main] INFO org.apache.nifi.bootstrap.RunNiFi - Registered no Notification Services for Notification Type NIFI_STOPPED
20:31:57.274 [main] INFO org.apache.nifi.bootstrap.RunNiFi - Registered no Notification Services for Notification Type NIFI_DIED
20:31:57.277 [main] DEBUG org.apache.nifi.bootstrap.Command - Status File: C:\NIFI-1~1.0-B\Node1\bin..\run\nifi.status
20:31:57.278 [main] DEBUG org.apache.nifi.bootstrap.Command - Status File: C:\NIFI-1~1.0-B\Node1\bin..\run\nifi.status
20:31:57.279 [main] DEBUG org.apache.nifi.bootstrap.Command - Properties: {pid=4284, port=53536}
20:31:57.280 [main] DEBUG org.apache.nifi.bootstrap.Command - Pinging 53536
20:31:58.343 [main] DEBUG org.apache.nifi.bootstrap.Command - Process with PID 4284 is not running
20:31:58.345 [main] INFO org.apache.nifi.bootstrap.Command - Apache NiFi is not running

Is it possible to have 2 master nodes?

is it possible to have a 2 master nodes? 1 having resource manager & 1 having node manager in yarn-environment
I am running 3 node cluster with yarn configuration. It gives me error as below in log file:
hadoop OpenJDK Server VM warning: You have loaded library /usr/lib/hadoop/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now. It's highly recommended that you fix the library with 'execstack -c ', or link it with '-z noexecstack'.
and some syslog like below
2014-12-09 11:36:16,138 INFO [Thread-65] org.apache.hadoop.ipc.Server: Stopping server on 55850
2014-12-09 11:36:16,143 INFO [IPC Server listener on 55850] org.apache.hadoop.ipc.Server: Stopping IPC Server listener on 55850
2014-12-09 11:36:16,143 INFO [IPC Server Responder] org.apache.hadoop.ipc.Server: Stopping IPC Server Responder
2014-12-09 11:36:16,144 INFO [TaskHeartbeatHandler PingChecker] org.apache.hadoop.mapreduce.v2.app.TaskHeartbeatHandler: TaskHeartbeatHandler thread interrupted
so, please give me some solution.

Hadoop: Datanode process killed

I am currently using Hadoop-2.0.3-alpha and after I could work perfectly with HDFS (copying files into HDFS, getting success from an external framework, using the webfrontend), after a new start of my VM, the datanode process is stopping after a while. The namenode process and all yarn processes work without a problem. I installed Hadoop in a folder under an additional user, as I also still have installed Hadoop 0.2, which worked fine too.
Taking a look at the log-file of all datanode processes I got the following information:
2013-04-11 16:23:50,475 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2013-04-11 16:24:17,451 INFO org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from hadoop-metrics2.properties
2013-04-11 16:24:23,276 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s).
2013-04-11 16:24:23,279 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: DataNode metrics system started
2013-04-11 16:24:23,480 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Configured hostname is user-VirtualBox
2013-04-11 16:24:28,896 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Opened streaming server at /0.0.0.0:50010
2013-04-11 16:24:29,239 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Balancing bandwith is 1048576 bytes/s
2013-04-11 16:24:38,348 INFO org.mortbay.log: Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog
2013-04-11 16:24:44,627 INFO org.apache.hadoop.http.HttpServer: Added global filter 'safety' (class=org.apache.hadoop.http.HttpServer$QuotingIn putFilter)
2013-04-11 16:24:45,163 INFO org.apache.hadoop.http.HttpServer: Added filter static_user_filter (class=org.apache.hadoop.http.lib.StaticUserWebFil ter$StaticUserFilter) to context datanode
2013-04-11 16:24:45,164 INFO org.apache.hadoop.http.HttpServer: Added filter static_user_filter (class=org.apache.hadoop.http.lib.StaticUserWebFil ter$StaticUserFilter) to context logs
2013-04-11 16:24:45,164 INFO org.apache.hadoop.http.HttpServer: Added filter static_user_filter (class=org.apache.hadoop.http.lib.StaticUserWebFil ter$StaticUserFilter) to context static
2013-04-11 16:24:45,355 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Opened info server at 0.0.0.0:50075
2013-04-11 16:24:45,508 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: dfs.webhdfs.enabled = false
2013-04-11 16:24:45,536 INFO org.apache.hadoop.http.HttpServer: Jetty bound to port 50075
2013-04-11 16:24:45,576 INFO org.mortbay.log: jetty-6.1.26
2013-04-11 16:25:18,416 INFO org.mortbay.log: Started SelectChannelConnector#0.0.0.0:50075
2013-04-11 16:25:42,670 INFO org.apache.hadoop.ipc.Server: Starting Socket Reader #1 for port 50020
2013-04-11 16:25:44,955 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Opened IPC server at /0.0.0.0:50020
2013-04-11 16:25:45,483 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Refresh request received for nameservices: null
2013-04-11 16:25:47,079 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Starting BPOfferServices for nameservices: <default>
2013-04-11 16:25:47,660 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Block pool <registering> (storage id unknown) service to localhost/127.0.0.1:8020 starting to offer service
2013-04-11 16:25:50,515 INFO org.apache.hadoop.ipc.Server: IPC Server Responder: starting
2013-04-11 16:25:50,631 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 50020: starting
2013-04-11 16:26:15,068 INFO org.apache.hadoop.hdfs.server.common.Storage: Lock on /home/hadoop/workspace/hadoop_space/hadoop23/dfs/data/in_use.lock acquired by nodename 3099#user-VirtualBox
2013-04-11 16:26:15,720 FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: Initialization failed for block pool Block pool BP-474150866-127.0.1.1-1365686732002 (storage id DS-317990214-127.0.1.1-50010-1365505141363) service to localhost/127.0.0.1:8020
java.io.IOException: Incompatible clusterIDs in /home/hadoop/workspace/hadoop_space/hadoop23/dfs/data: namenode clusterID = CID-1745a89c-fb08-40f0-a14d-d37d01f199c3; datanode clusterID = CID-bb3547b0-03e4-4588-ac25-f0299ff81e4f
at org.apache.hadoop.hdfs.server.datanode.DataStorage .doTransition(DataStorage.java:391)
at org.apache.hadoop.hdfs.server.datanode.DataStorage .recoverTransitionRead(DataStorage.java:191)
at org.apache.hadoop.hdfs.server.datanode.DataStorage .recoverTransitionRead(DataStorage.java:219)
at org.apache.hadoop.hdfs.server.datanode.DataNode.in itStorage(DataNode.java:850)
at org.apache.hadoop.hdfs.server.datanode.DataNode.in itBlockPool(DataNode.java:821)
at org.apache.hadoop.hdfs.server.datanode.BPOfferServ ice.verifyAndSetNamespaceInfo(BPOfferService.java: 280)
at org.apache.hadoop.hdfs.server.datanode.BPServiceAc tor.connectToNNAndHandshake(BPServiceActor.java:22 2)
at org.apache.hadoop.hdfs.server.datanode.BPServiceAc tor.run(BPServiceActor.java:664)
at java.lang.Thread.run(Thread.java:722)
2013-04-11 16:26:16,212 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Ending block pool service for: Block pool BP-474150866-127.0.1.1-1365686732002 (storage id DS-317990214-127.0.1.1-50010-1365505141363) service to localhost/127.0.0.1:8020
2013-04-11 16:26:16,276 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Removed Block pool BP-474150866-127.0.1.1-1365686732002 (storage id DS-317990214-127.0.1.1-50010-1365505141363)
2013-04-11 16:26:18,396 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Exiting Datanode
2013-04-11 16:26:18,940 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 0
2013-04-11 16:26:19,668 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:
/************************************************** **********
SHUTDOWN_MSG: Shutting down DataNode at user-VirtualBox/127.0.1.1
************************************************** **********/
Any ideas? May be I made a mistake during the installation process? But it is strange, that it worked once. I also have to say, that if I am logged in as my additional user to execute the commands ./hadoop-daemon.sh start namenode and the same with the datanode, I need to add sudo.
I used this installation guide: http://jugnu-life.blogspot.ie/2012/0...rial-023x.html
By the way, I use the Oracle Java-7 version.
The problem could be that the namenode was formatted after the cluster was set up and the datanodes were not, so the slaves are still referring to the old namenode.
We have to delete and recreate the folder /home/hadoop/dfs/data on the local filesystem for the datanode.
Check your hdfs-site.xml file to see where dfs.data.dir is pointing to
and delete that folder
and then restart the datanode daemon on the machine
The steps above should recreate the folder and resolve the problem.
Please share your config info if the instructions above do not work.
DataNode dies because of incompatible Clusterids. To fix this problem
If you are using hadoop 2.X, then you have to delete everything in the folder that you have specified in hdfs-site.xml - "dfs.datanode.data.dir" (but NOT the folder itself).
The ClusterID will be maintained in that folder. Delete and restart dfs.sh. This should work!!!
You need to delete both
C:\hadoop\data\dfs\datanode and
C:\hadoop\data\dfs\namenode folders.
If you don't have this folders - open your C:\hadoop\etc\hadoop\hdfs-site.xml file and get paths for this folders for next deletion. For me it says:
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/hadoop/data/dfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/hadoop/data/dfs/datanode</value>
</property>
Run command for Format namenodec:\hadoop\bin>hdfs namenode -format
Now it should work!
I think the recommended way of doing this without deleting the data directory is to simply change the clusterID variable in the datanode's VERSION file.
If you look in your daemons directory, you will see the datanode directory exmaple
data/hadoop/daemons/datanode
The VERSION file should look like this.
cat current/VERSION
#Tue Oct 14 17:31:58 CDT 2014
storageID=DS-23bf7f3a-085c-4531-808f-801ff6d52d14
clusterID=CID-bb3547b0-03e4-4588-ac25-f0299ff81e4f
cTime=0
datanodeUuid=63154929-ae68-4149-9f75-9a6558545041
storageType=DATA_NODE
layoutVersion=-55
You need to change the clusterId to the first value in the output of the message so in your case that would be CID-1745a89c-fb08-40f0-a14d-d37d01f199c3 instead of CID-bb3547b0-03e4-4588-ac25-f0299ff81e4f
The updated version should appear like this with the altered clusterId
cat current/VERSION
#Tue Oct 14 17:31:58 CDT 2014
storageID=DS-23bf7f3a-085c-4531-808f-801ff6d52d14
clusterID=CID-1745a89c-fb08-40f0-a14d-d37d01f199c3
cTime=0
datanodeUuid=63154929-ae68-4149-9f75-9a6558545041
storageType=DATA_NODE
layoutVersion=-55
Restart hadoop and the datanode should start just fine.

Resources