The error about “Report all chuns to master with exception [RemoteRun[master]] Unrecognized column……” - cluster-computing

(1)There is a common cluster which contains machine A and B. The machine A is the controller node. Each machine owns one data node.
(2)After shutting down and upgrading the DolphinDB Server on machine A, the controller node, agent node and data node are restarted and the data node is normal.
The DolphinDB Server on machine B has not been upgraded. After I restart the agent node and data node, the data node cannot be initialized and an error is reported all the time:
Report all chuns to master with exception [RemoteRun[master]] Unrecognized column……
How to resolve this issue?

The current version (1.30.13, 2.0.1, 1.20.22) does not support rolling upgrades, so two physical machines A and B need to be shut down at the same time, then upgrade the DolphinDB Server on physical machines A and B, and then restart A and B The control node (Controller), agent node (Agent), and data node (Datanode) of the physical machine.
So the solution for the above situation:
(1) Shut down the entire cluster
(2) Complete DolphinDB Server of physical machine B
(3) Restart the control node (Controller), agent node (Agent), and data node (Datanode) of physical machines A and B
The subsequent version of DolphinDB will support rolling upgrades to avoid the occurrence of the above situation.

Related

Apache Ignite 2.7 to 2.10 upgrade: Server Node can not rejoin cluster

I have a 5 node Service Grid running on an Ignite 2.10.0 Cluster. Testing the upgrade, I stop one Server Node (SIGTERM) and wait for it it rejoin. It fails to stay connected to the cluster?
Each node is a primary micro service provider and a back for another (Cluster Singletons). The service that was running on the node that left the cluster is properly picked up by it's backup node. However, the server node can not stay connected to the cluster ever again!
Rejoin strategy:
Let systemd restart ignite.
The node rejoins, but then the new Server Node invokes it's shutdown-hook
Go back to 1
I have no idea why the rejoined node shuts itself down. As far as I can tell, the Coordinator did not kill this youngest Server Node. I am logging with DEBUG and IGNITE_QUEIT set to false; I still can't find anyting in the logs.
I tried increasing network timeouts, but the newly re-joined node still shuts down???
Any idea what is going on or where to look?
Thanks in advance.
Greg
Environment:
RHEL 7.9, Java 11
Ignite configuration:
persistence is set to false.
clientReconnectDisabled is set to true

If master node failed then how can recover all data on master node and how to again start hadoop cluster?

I have three master,slave1,salve2 cluser server of hadoop and My question is like if master server of ambari system failed then how can we recover ? Do we need to add new server and install ambari again or how can we recover our data from failed server. if added new server we can assign as master then how can we do ? could suggest me about master server down then how can resolve this issue ?
Thanks in advance.
No data retrieval of data if the Name Node dies and you have no backup. You need a backup Name Node (aka Secondary Name Node) which will take metadata backup after every fixed interval. This interval is generally long so u still lose some data
With hadoop 2.0 u can take more frequent backup with help of a passive name node which becomes active if the main name node dies and data is still accessible.

Hortonworks Data Platform: High load causes node restart

I have setup a Hadoop Cluster with Hortonworks Data Platform 2.5. I'm using 1 master and 5 slave (worker) nodes.
Every few days one (or more) of my worker nodes gets a high load and seem to restart the whole CentOS operating system automatically. After the restart the Hadoop components don't run anymore and have to be restarted manually via the Amabri management UI.
Here a screenshot of the "crashed" node (reboot after the high load value ~4 hours ago):
Here a screenshot of one of other "healthy" worker node (all other workers have similar values):
The node crashes alternate between the 5 worker nodes, the master node seems to run without problems.
What could cause this problem? Where are these high load values coming from?
This seems to be a Kernel problem, as the log file (e.g. /var/spool/abrt/vmcore-127.0.0.1-2017-06-26-12:27:34/backtrace) says something like
Version: 3.10.0-327.el7.x86_64
BUG: unable to handle kernel NULL pointer dereference at 00000000000001a0
After running a sudo yum update I had the kernel version
[root#myhost ~]# uname -r
3.10.0-514.26.2.el7.x86_64
Since the operating system updates the problem didn't occur anymore. I will observe the issue and give feedback if neccessary.

Understand Spark: Cluster Manager, Master and Driver nodes

Having read this question, I would like to ask additional questions:
The Cluster Manager is a long-running service, on which node it is running?
Is it possible that the Master and the Driver nodes will be the same machine? I presume that there should be a rule somewhere stating that these two nodes should be different?
In case where the Driver node fails, who is responsible of re-launching the application? and what will happen exactly? i.e. how the Master node, Cluster Manager and Workers nodes will get involved (if they do), and in which order?
Similarly to the previous question: In case where the Master node fails, what will happen exactly and who is responsible of recovering from the failure?
1. The Cluster Manager is a long-running service, on which node it is running?
Cluster Manager is Master process in Spark standalone mode. It can be started anywhere by doing ./sbin/start-master.sh, in YARN it would be Resource Manager.
2. Is it possible that the Master and the Driver nodes will be the same machine? I presume that there should be a rule somewhere stating that these two nodes should be different?
Master is per cluster, and Driver is per application. For standalone/yarn clusters, Spark currently supports two deploy modes.
In client mode, the driver is launched in the same process as the client that submits the application.
In cluster mode, however, for standalone, the driver is launched from one of the Worker & for yarn, it is launched inside application master node and the client process exits as soon as it fulfils its responsibility of submitting the application without waiting for the app to finish.
If an application submitted with --deploy-mode client in Master node, both Master and Driver will be on the same node. check deployment of Spark application over YARN
3. In the case where the Driver node fails, who is responsible for re-launching the application? And what will happen exactly? i.e. how the Master node, Cluster Manager and Workers nodes will get involved (if they do), and in which order?
If the driver fails, all executors tasks will be killed for that submitted/triggered spark application.
4. In the case where the Master node fails, what will happen exactly and who is responsible for recovering from the failure?
Master node failures are handled in two ways.
Standby Masters with ZooKeeper:
Utilizing ZooKeeper to provide leader election and some state storage,
you can launch multiple Masters in your cluster connected to the same
ZooKeeper instance. One will be elected “leader” and the others will
remain in standby mode. If the current leader dies, another Master
will be elected, recover the old Master’s state, and then resume
scheduling. The entire recovery process (from the time the first
leader goes down) should take between 1 and 2 minutes. Note that this
delay only affects scheduling new applications – applications that
were already running during Master failover are unaffected. check here
for configurations
Single-Node Recovery with Local File System:
ZooKeeper is the best way to go for production-level high
availability, but if you want to be able to restart the Master if
it goes down, FILESYSTEM mode can take care of it. When applications
and Workers register, they have enough state written to the provided
directory so that they can be recovered upon a restart of the Master
process. check here for conf and more details
The Cluster Manager is a long-running service, on which node it is running?
A cluster manager is just a manager of resources, i.e. CPUs and RAM, that SchedulerBackends use to launch tasks.
A cluster manager does nothing more to Apache Spark, but offering resources, and once Spark executors launch, they directly communicate with the driver to run tasks.
You can start a standalone master server by executing:
./sbin/start-master.sh
Can be started anywhere.
To run an application on the Spark cluster
./bin/spark-shell --master spark://IP:PORT
Is it possible that the Master and the Driver nodes will be the same machine?
I presume that there should be a rule somewhere stating that these two nodes should be different?
In standalone mode, when you start your machine certain JVM will start.Your SparK Master will start up and on each machine Worker JVM will start and they will register with the Spark Master.
Both are the resource manager.When you start your application or submit your application in cluster mode a Driver will start up wherever you do ssh to start that application.
Driver JVM will contact to the SparK Master for executors(Ex) and in standalone mode Worker will start the Ex.
So Spark Master is per cluster and Driver JVM is per application.
In case where the Driver node fails, who is responsible of re-launching the application? and what will happen exactly?
i.e. how the Master node, Cluster Manager and Workers nodes will get involved (if they do), and in which order?
If a Ex JVM will crashes the Worker JVM will start the Ex and when Worker JVM ill crashes Spark Master will start them.
And with a Spark standalone cluster with cluster deploy mode, you can also specify --supervise to make sure that the driver is automatically restarted if it fails with non-zero exit code.Spark Master will start Driver JVM
Similarly to the previous question: In case where the Master node fails,
what will happen exactly and who is responsible of recovering from the failure?
failing on master will result in executors not able to communicate with it. So, they will stop working. Failing of master will make driver unable to communicate with it for job status. So, your application will fail.
Master loss will be acknowledged by the running applications but otherwise these should continue to work more or less like nothing happened with two important exceptions:
1.application won't be able to finish in elegant way.
2.if Spark Master is down Worker will try to reregisterWithMaster. If this fails multiple times workers will simply give up.
reregisterWithMaster()-- Re-register with the active master this worker has been communicating with. If there is none, then it means this worker is still bootstrapping and hasn't established a connection with a master yet, in which case we should re-register with all masters.
It is important to re-register only with the active master during failures.worker unconditionally attempts to re-register with all masters,
will may arise race condition.Error detailed in SPARK-4592:
At this moment long running applications won't be able to continue processing but it still shouldn't result in immediate failure.
Instead application will wait for a master to go back on-line (file system recovery) or a contact from a new leader (Zookeeper mode), and if that happens it will continue processing.

AppFabric Cache Cluster not detecting a node has failed in a timely fashion

Setup:
We're using AppFabric 1.1 on Windows 2008 Enterprise Edition VMs.
We setup a cluster with three nodes using SQL server for cluster configuration and also using offloading so SQL server is supposed to do the cluster management by making sure to create the cluster with: New-AFCacheCluster -Offloading true. We then add the three nodes and start the cluster up. All is good.
We then setup a single cache instance, call it "Test", with HA using the -Secondaries 1 option.
Test Scenario:
We then use a test app to put some test data into the cache and access that data and everything is working great. So then we go to the VM host and down the NIC for one of the nodes in the cluster to simulate that node's failure.
Results:
As soon as the NIC is disabled on the one node, when we go to read from the cache we get timeouts instead of a clean failover.
If we go run Get-AFCacheHostStatus on either of the other two hosts that are still up, the first time after the NIC is disabled, this call will take a very long time to return the status of the hosts. Once it finally does return status, it shows the node on which we yanked the NIC as being in UNKNOWN status. Subsequent calls to Get-AFCacheHostStatus will return quickly, but always showing the error message that the one node is unreachable and shows it in the UNKNOWN status.
Ok, so AF itself detects that node is in UNKNOWN status, but the test app is still getting timeouts at this point. Some minutes later, somewhere btwn 5-10mins, the app will eventually start working again with only the two nodes we have left.
Sooo, what's going on here? Are we configuring something incorrectly? Why is the cluster taking so long to recover from this basic kind of failure?

Resources