I want to add a replica of our whole eDirectory tree to a new server (OES11.2 SLES11.3).
So I wanted to do so via iManager. (Partitions and Replicas / Replica View / Add Replica)
Everthing looks normal. I see our other servers with added replicas and of course the server with the master image.
For addition information: I did that a lot of times without problems until now.
When I want to add a replica to the new server, i get the following error: (Error -636) The server is unreachable.
I checked the /etc/hosts file and the network settings on both servers.
Ndsrepair looks normal too. All servers are in sync and there are no connection errors. The replica depth of the new server is -1. I get that, because there is no replica on it yet.
But if i can connect from one server to another and there are no error messages, why does adding a replica not work?
I also tried to make a LAN trace, but didn't get any information that would help me out here. In the trace the communication seems normal!
Am I forgetting something here?
Every server in our environment runs OES11.2 except the master server which runs OES11.1
Thanks for your help!
Daniel
Nothing wrong.
Error -636 means that the replica is not yet available at the new server. When will the synchronization, the replica will be ready and available. Depending on the size of the Tree and the communication channel we can wait for up to some hours.
Related
I have an issue where i have multiple host dashboards for the same elasticsearch server. Both dashboards has its own name and way of collecting data. One is connected to the installed datadog-agent and the other is somehow connected to the elasticsearch service directly.
The weird thing is that i cannot seem to find a way to turn off the agent connected directly to the ES service, other than turning off the elasticsearch service completly.
I have tried to delete the datadog-agent completely. This stops the dashboard connected to it, to stop receiving data (of course) but the other dashboard keeps receiving data somehow. I cannot find what is sending this data and therefor is not able to stop it. We have multiple master and data node and this is an issue for all of them. ES version is 7.17
another of our clusters is running ES 6.8, and we have not made the final configuration of the monitoring of this cluster but for now it does not have this issue.
just as extra information:
The dashboard connected to the agent is called the same as the host server name, while the other only has the internal ip as it's host name.
Does anyone have any idea what it is that is running and how to stop it? I have tried almost everything i could think of.
i finally found the reason. as all datadog-agents on all master and data nodes was configured to not use the node name as the name and cluster stats was turned on for the elastic plugin for datadog. This resulted in the behavior that when even one of the datadog-agents in the cluster was running, data was coming in to the dashboard which was not named correclty. Leaving the answer here if anyone hits the same situation in the future.
I'm having an issue setting up my cluster according to the documents, as seen here: https://docs.sensu.io/sensu-go/5.5/guides/clustering/
This is a non-https setup to get my feet wet, I'm not concerned with that at the moment. I just want a running cluster to begin with.
I've set up sensu-backend on my three nodes, and have configured the backend configuration (backend.yml) accordingly on all three nodes through an ansible playbook. However, my cluster does not discover the other two nodes. It simply shows the following:
For backend1:
=== Etcd Cluster ID: 3b0efc7b379f89be
ID Name Peer URLs Client URLs
────────────────── ─────────────────── ─────────────────────── ───────────────────────
8927110dc66458af backend1 http://127.0.0.1:2380 http://localhost:2379
For backend2 and backend3, it's the same, except it shows those individual nodes as the only nodes in their cluster.
I've tried both the configuration in the docs, as well as the configuration in this git issue: https://github.com/sensu/sensu-go/issues/1890
None of these have panned out for me. I've ensured all the ports are open, so that's not an issue.
When I do a manual sensuctl cluster member-add X X, I get an error message and it results in the sensu-backend process failing. I can't remove the member, either, because it causes the entire process to not be able to start. I have to revert to an earlier snapshot to fix it.
The configs on all machines are the same, except the IP's and names are appropriated for each machine
etcd-advertise-client-urls: "http://XX.XX.XX.20:2379"
etcd-listen-client-urls: "http://XX.XX.XX.20:2379"
etcd-listen-peer-urls: "http://0.0.0.0:2380"
etcd-initial-cluster: "backend1=http://XX.XX.XX.20:2380,backend2=http://XX.XX.XX.31:2380,backend3=http://XX.XX.XX.32:2380"
etcd-initial-advertise-peer-urls: "http://XX.XX.XX.20:2380"
etcd-initial-cluster-state: "new" # have also tried existing
etcd-initial-cluster-token: ""
etcd-name: "backend1"
Did you find the answer to your question? I saw that you posted over on the Sensu forums as well.
In any case, the easiest thing to do in this case would be to stop the cluster, blow out /var/lib/sensu/sensu-backend/etcd/ and reconfigure the cluster. As it stands, the behavior you're seeing seems like the cluster members were started individually first, which is what is potentially causing the issue and would be the reason for blowing the etcd dir away.
Hello Kafka/Zookeeper users,
My team has a kafka cluster which works in conjunction with Apache zookeeper. The kafka is hosted on EC2. For any number of reasons, the EC2 host can go down and be replaced by a new host. The new host has a different broker id as compared to previous one (id generated by AWS, not us).
At this point, zookeeper still has the old state where previous host was replica of some partitions.
Although leader re-election happened successfully, the new replacement host was not utilized in any way as leader or replica.
The kafka documentation talks about 'broker coming up again' after sometime, but in EC2 world host is permanently replaced.
In distributed systems terminology we only attempt to handle a "fail/recover" model of failures where nodes suddenly cease working and then later recover (perhaps without knowing that they have died).
I understand the reason for that. Zookeeper contains state of each partition. That state contains the old dead host as leader and/or follower. When new host comes up, this state is not getting updated to include new host, until we manually run a command to set replicas.
Is there a way kafka can automatically utilize the new broker as leader or ISR?
This is causing lot of operational burden on our team to manually assign new broker as replica and trigger 'preferred leader election'.
Preferred leader election can be triggered automatically by turning on config auto.leader.rebalance.enable and tuning leader.imbalance.per.broker.percentage.
However, the problem you are facing is that:
new servers will not automatically be assigned any existing data
partitions, so unless partitions are moved to them they won't be doing
any work until new topics are created.
Seems you have to figure out a scheme that is able to automatically execute kafka-reassign-partitions.sh script whenever a replacement occurs. No purely-automatic scheme is offered out of box.
Setup:
We're using AppFabric 1.1 on Windows 2008 Enterprise Edition VMs.
We setup a cluster with three nodes using SQL server for cluster configuration and also using offloading so SQL server is supposed to do the cluster management by making sure to create the cluster with: New-AFCacheCluster -Offloading true. We then add the three nodes and start the cluster up. All is good.
We then setup a single cache instance, call it "Test", with HA using the -Secondaries 1 option.
Test Scenario:
We then use a test app to put some test data into the cache and access that data and everything is working great. So then we go to the VM host and down the NIC for one of the nodes in the cluster to simulate that node's failure.
Results:
As soon as the NIC is disabled on the one node, when we go to read from the cache we get timeouts instead of a clean failover.
If we go run Get-AFCacheHostStatus on either of the other two hosts that are still up, the first time after the NIC is disabled, this call will take a very long time to return the status of the hosts. Once it finally does return status, it shows the node on which we yanked the NIC as being in UNKNOWN status. Subsequent calls to Get-AFCacheHostStatus will return quickly, but always showing the error message that the one node is unreachable and shows it in the UNKNOWN status.
Ok, so AF itself detects that node is in UNKNOWN status, but the test app is still getting timeouts at this point. Some minutes later, somewhere btwn 5-10mins, the app will eventually start working again with only the two nodes we have left.
Sooo, what's going on here? Are we configuring something incorrectly? Why is the cluster taking so long to recover from this basic kind of failure?
I am getting this exception when for a while i didn't communicated with HBase:
org.apache.hadoop.hbase.client.NoServerForRegionException: Timed out trying to locate root region because: Connection refused
is this something related with session expiry, if so, how can i extend session lifetime?
Query bin/hbase hbck and find in which machine root Regionserver is running..
You should get -ROOT- is okay on hbck. Make sure that all your
Regionserver is up and running.
use start regionserver for starting regionserver
I don't think this has anything to do with session lifetime.
Check your cluster to make sure that it is up and working correctly and all region servers are alive. Then check the logs to make sure that they are not reporting some error state.
HBase is complex software -- without more detailed information it is very difficult to diagnose what is going on. And often you can discover the problem by collecting the more detailed information.
This error shows that the client is not able to talk to Region server.
Check the region server associated with the region its trying to connect and check its up.
To identify the region server associated with the region please go through http://hbase.apache.org/0.94/book/regions.arch.html#regions.arch.assignment
Some factors have played a role here.
Please note the below steps which occur when you try to connect to Hbase from a client,
Hbase connects to Zookeeper to get the Ip of the regionservers which host the ROOT table.
The client caches this information about the IP's so that it doesnt have to contact the zookeeper again.
Your problem is that, your client is trying to connect to the zookeeper to get the IP. one of the below things may be going wrong,
Your client is not able to connect to the zookeeper.
The information about the ROOT contained inside the Znode in ZooKeeper is wrong.
Possible fixes.
Check if your zookeeper is working fine.
Delete the Znode for Hbase in your Zookeeper and restart the cluster. Don't worry, this wont delete your data.
Once this is achieved? the client can get the ROOT information and then query for the META table without any issue.