Cloudant : Error with running weatherreport to check cluster health - cluster-computing

We have three node cluster setup and facing issue to run weather report command.
By looking at error, it is clear that machine from where weatherreport utility is running not able to connect to other two machines. I have checked all machines and they are accessible using fqdn. But from message it looks like it is using shortname while connecting to peer machine. So how to check from where it is taking peer machine names? So I can give a try to change them to full machine name and that might work for me. if there is any other solution then let us know.
Error is coming as
['cloudant_diag17506#machine2031.domain.com'] [crit] Could not run check weatherreport_check_safe_to_rebuild on cluster node 'cloudant#machine2031'
['cloudant_diag17506#machine2031.domain.com'] [crit] Could not run check weatherreport_check_safe_to_rebuild on cluster node 'cloudant#machine2032'
['cloudant_diag17506#machine2031.domain.com'] [crit] Could not run check weatherreport_check_safe_to_rebuild on cluster node 'cloudant#machine2033'
['cloudant#machine2032.domain.com'] [crit] Rebuilding this node will leave the following shard with NO live copies: default/t_alpha e0000000-ffffffff, default/t_alpha a0000000-bfffffff, default/t_alpha 60000000-7fffffff, default/t_alpha 20000000-3fffffff, default/metrics_app e0000000-ffffffff, default/metrics_app a0000000-bfffffff, default/metrics_app 60000000-7fffffff, default/metrics_app 20000000-3fffffff

I got solution for this problem.
It was problem that when DB was created first time, short name was used so in database it might be referring for short name to connect to other peer hosts.
Now that the Cloudant Local installation is in problematic stage, to make it consistent would be to remove all the files under /srv/cloudant/ on all database nodes. This will remove all default Cloudant databases. Then run the configure.sh script again on each node as before but now that "hostname -f" correctly outputs the fully qualified host name, then create your databases again.

Related

Cannot find datadog agent connected to elasticserch

I have an issue where i have multiple host dashboards for the same elasticsearch server. Both dashboards has its own name and way of collecting data. One is connected to the installed datadog-agent and the other is somehow connected to the elasticsearch service directly.
The weird thing is that i cannot seem to find a way to turn off the agent connected directly to the ES service, other than turning off the elasticsearch service completly.
I have tried to delete the datadog-agent completely. This stops the dashboard connected to it, to stop receiving data (of course) but the other dashboard keeps receiving data somehow. I cannot find what is sending this data and therefor is not able to stop it. We have multiple master and data node and this is an issue for all of them. ES version is 7.17
another of our clusters is running ES 6.8, and we have not made the final configuration of the monitoring of this cluster but for now it does not have this issue.
just as extra information:
The dashboard connected to the agent is called the same as the host server name, while the other only has the internal ip as it's host name.
Does anyone have any idea what it is that is running and how to stop it? I have tried almost everything i could think of.
i finally found the reason. as all datadog-agents on all master and data nodes was configured to not use the node name as the name and cluster stats was turned on for the elastic plugin for datadog. This resulted in the behavior that when even one of the datadog-agents in the cluster was running, data was coming in to the dashboard which was not named correclty. Leaving the answer here if anyone hits the same situation in the future.

Starting a node in emqtt and creating cluster

I am new to emqtt and erlang. Using the documentation provided in emqtt.io I configured the emqtt in my machine and wanted to create a cluster.
I followed the steps given below to create a node
erl -name node1#127.0.0.1
erl -name node2#127.0.0.1
And to connect these nodes i used the below command.
(node1#127.0.0.1)1> net_kernel:connect_node('node2#127.0.0.1')
I am not getting any response(true or false) after executing this command.
Also I tried the following command
./bin/emqttd_ctl cluster emqttd#192.168.0.10
but got a failure message
Failed to join the cluster: {node_down,'node1#127.0.0.1'}
When I hit the URL localhost:8080/status I am getting the following message
Node emq#127.0.0.1 is started
emqttd is running
But i couldn't get any details about the cluster.
Am I following the right steps?. Need help on the creation of cluster in emqtt.
Thanks in advance!!
For each node that is created in a machine a separate process is initiated and on creating many bodes will finally end up with using the memory the most which leads to a situation where you will not be able to join any nodes in a cluster. Hence to join we have to stop the nodes that are not in use using the ./emqttd stop command
You need two emqx nodes running on different machine, as the port may conflicts with each other on the same machine.
And the node names MUST not use loopback ip address 127.0.0.1 such as node1#127.0.0.1.

Running multiple mesos slaves locally

I'm trying to run a test cluster locally following this guide https://mesosphere.com/2014/07/07/installing-mesos-on-your-mac-with-homebrew/
Currently, I'm able to have a master running at localhost:5050 and a slave running at the default port 5051 (with slave id say S0). However, when I tried to start another slave at a different port, it re-registered itself as S0 and the master console only showed 1 activated slave. Does anybody know how would I start another slave S1? Thanks!
Did you specify a another work_dir?
E.g.
sudo /usr/local/sbin/mesos-slave --master=localhost:5050 --port=5052 -- work_dir=/tmp/mesos2
To explain a bit why this is needed/ where the error you saw came from.
Mesos supports so called slave recovery for helping with upgrades and error recovery.
Therefore when starting a slave, it will check its work_dir for checkpoint and try to recover that state (i.e. reconnect to still running executors).
In your case as both slaves wanted to start from the same working directory, the second one tried to recover the checkpoint of the still running first slave...
P.S. I should probably replace all the above occurences of slave with worker (https://issues.apache.org/jira/browse/MESOS-1478), but I hope this is easier to read.

Datastax Opscenter issue: dashboard timeout

I installed Datastax community version in an EC2 server and it worked fine. After that I tried to add one more server and I see two nodes in the Nodes menu but in the main dashboard I see the following error:
Error: Call to /Test_Cluster__No_AMI_Parameters/rc/dashboard_presets/ timed out.
One potential rootcause I can see is the name of the cluster? I specified something else in the cassandra.yaml but it looks like opscenter is still using the original name? Any help would be grealy appreciated.
It was because cluster name change wasn't made properly. I found it easier to change the cluster name before starting Cassandra cluster. On top of this, only one instance of opscentered needs to run in one single cluster. datastax-agent needs to be running in all nodes in the cluster but they need to point to the same opscenterd (change needs to be made at /var/lib/datastax-agent/conf/address.yaml)

HBase NoServerForRegionException?

I am getting this exception when for a while i didn't communicated with HBase:
org.apache.hadoop.hbase.client.NoServerForRegionException: Timed out trying to locate root region because: Connection refused
is this something related with session expiry, if so, how can i extend session lifetime?
Query bin/hbase hbck and find in which machine root Regionserver is running..
You should get -ROOT- is okay on hbck. Make sure that all your
Regionserver is up and running.
use start regionserver for starting regionserver
I don't think this has anything to do with session lifetime.
Check your cluster to make sure that it is up and working correctly and all region servers are alive. Then check the logs to make sure that they are not reporting some error state.
HBase is complex software -- without more detailed information it is very difficult to diagnose what is going on. And often you can discover the problem by collecting the more detailed information.
This error shows that the client is not able to talk to Region server.
Check the region server associated with the region its trying to connect and check its up.
To identify the region server associated with the region please go through http://hbase.apache.org/0.94/book/regions.arch.html#regions.arch.assignment
Some factors have played a role here.
Please note the below steps which occur when you try to connect to Hbase from a client,
Hbase connects to Zookeeper to get the Ip of the regionservers which host the ROOT table.
The client caches this information about the IP's so that it doesnt have to contact the zookeeper again.
Your problem is that, your client is trying to connect to the zookeeper to get the IP. one of the below things may be going wrong,
Your client is not able to connect to the zookeeper.
The information about the ROOT contained inside the Znode in ZooKeeper is wrong.
Possible fixes.
Check if your zookeeper is working fine.
Delete the Znode for Hbase in your Zookeeper and restart the cluster. Don't worry, this wont delete your data.
Once this is achieved? the client can get the ROOT information and then query for the META table without any issue.

Resources