Detecting and recovering failed H2 cluster nodes - h2

After going through H2 developer guide I still don't understand how can I find out what cluster node(s) was/were failing and which database needs to be recovered in the event of temporary network failure.
Let's consider the following scenario:
H2 cluster started with N active nodes (is actually it true that H2 can support N>2, i.e. more than 2 cluster nodes?)
(lots DB updates, reads...)
Network connection with one (or several) cluster nodes gets down and node becomes invisible to the rest of the cluster
(lots of DB updates, reads...)
Network link with previously disconnected node(s) restored
It is discovered that cluster node was probably missing (as far as I can see SELECT VALUE FROM INFORMATION_SCHEMA.SETTINGS WHERE NAME='CLUSTER' starts responding with empty string if one node in cluster fails)
After this point it is unclear how to find out what nodes were failing?
Obviously, I can do some basic check like comparing DB size, but it is unreliable.
What is the recommended procedure to find out what node was missing in the cluster, esp. if query above responds with empty string?
Another question - why urlTarget doesn't support multiple parameters?
How I am supposed to use CreateCluster tool if multiple nodes in the cluster failed and I want to recover more than one?
Also I don't understand how CreateCluster works if I had to stop the cluster and I don't want to actually recover any nodes? What's not clear to me is what I need to pass to CreateCluster tool if I don't actually need to copy database.

That is partially right SELECT VALUE FROM INFORMATION_SCHEMA.SETTINGS WHERE NAME='CLUSTER', will return an empty string when queried in standard mode.
However, you can get the list of servers by using Connection.getClientInfo() as well, but it is a two-step process. Paraphrased from h2database.com:
The list of properties returned by getClientInfo() includes a numServers property that returns the number of servers that are in the connection list. getClientInfo() also has properties server0..serverN, where N is the number of servers - 1. So to get the 2nd server from the list you use getClientInfo('server1').
Note: The serverX property only returns IP addresses and ports and not
hostnames.
And before you say simple replication, yes that is default operation, but you can do more advanced things that are outside the scope of your question in clustered H2.
Here's the quote for what you're talking about:
Clustering can only be used in the server mode (the embedded mode does not support clustering). The cluster can be re-created using the CreateCluster tool without stopping the remaining server. Applications that are still connected are automatically disconnected, however when appending ;AUTO_RECONNECT=TRUE, they will recover from that.
So yes if the cluster stops, auto_reconnect is not enabled, and you stick with the basic query, you are stuck and it is difficult to find information. While most people will tell you to look through the API and or manual, they haven't had to look through this one so, my sympathies.
I find it way more useful to track through the error codes, because you get a real good idea of what you can do when you see how the failure is planned for ... here you go.

Related

Elasticsearch - two out of three nodes instant shutdown case

We have a small Elasticsearch cluster for 3 nodes: two in one datacenter and one in another for disaster recovery reasons. However, if the first two nodes fail simultaneously, the third one won't work either - it will just throw "master not discovered or elected yet".
I understand that this is intended - this is how Elasticsearch cluster should work. But is there some additional special configuration that I don't know to keep the third single node working, even if in the read-only mode?
nope, there's not. as you mentioned it's designed that way
you're probably not doing yourselves a lot of favours by running things across datacentres like that. network issues are not kind on Elasticsearch due to it's distributed nature
Elasticsearch runs in distributed mode by default. Nodes assume that there are or will be a part of the cluster, and during setup nodes try to automatically join the cluster.
If you want your Elasticsearch to be available for only node without the need to communicate with other Elasticsearch nodes. It works similar to a standalone server. To do this we can tell Elasticsearch to work in local only (disable network)
open your elasticsearch/config/elasticsearch.yml and set:
node.local: true

Eventual consistency - how to avoid phantoms

I am new to the topic. Having read a handful of articles on it, and asked a couple of persons, I still do not understand what you people do in regard to one problem.
There are UI clients making requests to several backend instances (for now it's irrelevant whether sessions are sticky or not), and those instances are connected to some highly available DB cluster (may it be Cassandra or something else of even Elasticsearch). Say the backend instance is not specifically tied to one or cluster's machines, and instead its every request to DB may be served by a different machine.
One client creates some record, it's synchronously of asynchronously stored to one of cluster's machines then eventually gets replicated to the rest of DB machines. Then another client requests the list or records, the request ends up served by a distant machine not yet received the replicated changes, and so the client does not see the record. Well, that's bad but not yet ugly.
Consider however that the second client hits the machine which has the record, displays it in a list, then refreshes the list and this time hits the distant machine and again does not see the record. That's very weird behavior to observe, isn't it? It might even get worse: the client successfully requests the record, starts some editing on it, then tries to store the updates to DB and this time hits the distant machine which says "I know nothing about this record you are trying to update". That's an error which the user will see while doing something completely legitimate.
So what's the common practice to guard against this?
So far, I only see three solutions.
1) Not actually a solution but rather a policy: ignore the problem and instead speed up the cluster hard enough to guarantee that 99.999% of changes will be replicated on the whole cluster in, say, 0.5 secord (it's hard to imagine some user will try to make several consecutive requests to one record in that time; he can of course issue several reading requests, but in that case he'll probably not notice inconsistency between results). And even if sometimes something goes wrong and the user faces the problem, well, we just embrace that. If the loser gets unhappy and writes a complaint to us (which will happen maybe once a week or once an hour), we just apologize and go on.
2) Introduce an affinity between user's session and a specific DB machine. This helps, but needs explicit support from the DB, and also hurts load-balancing, and invites complications when the DB machine goes down and the session needs to be re-bound to another machine (however with proper support from DB I think that's possible; say Elasticsearch can accept routing key, and I believe if the target shard goes down it will just switch the affinity link to another shard - though I am not entirely sure; but even if re-binding happens, the other machine may contain older data :) ).
3) Rely on monotonic consistency, i.e. some method to be sure that the next request from a client will get results no older than the previous one. But, as I understand it, this approach also requires explicit support from DB, like being able so pass some "global version timestamp" to a cluster's balancer, which it will compare with it's latest data on all machines' timestamps to determine which machines can serve the request.
Are there other good options? Or are those three considered good enough to use?
P.S. My specific problem right now is with Elasticsearch; AFAIK there is no support for monotonic reads there, though looks like option #2 may be available.
Apache Ignite has primary partition for a key and backup partitions. Unless you have readFromBackup option set, you will always be reading from primary partition whose contents is expected to be reliable.
If a node goes away, a transaction (or operation) should be either propagated by remaining nodes or rolled back.
Note that Apache Ignite doesn't do Eventual Consistency but instead Strong Consistency. It means that you can observe delays during node loss, but will not observe inconsistent data.
In Cassandra if using at least quorum consistency for both reads and writes you will get monotonic reads. This was not the case pre 1.0 but thats a long time ago. There are some gotchas if using server timestamps but thats not by default so likely wont be an issue if using C* 2.1+.
What can get funny is since C* uses timestamps is things that occur at "same time". Since Cassandra is Last Write Wins the times and clock drift do matter. But concurrent updates to records will always have race conditions so if you require strong read before write guarantees you can use light weight transactions (essentially CAS operations using paxos) to ensure no one else updates between your read to update, these are slow though so I would avoid it unless critical.
In a true distributed system, it does not matter where your record is stored in remote cluster as long as your clients are connected to that remote cluster. In Hazelcast, a record is always stored in a partition and one partition is owned by one of the servers in the cluster. There could be X number of partitions in the cluster (by default 271) and all those partitions are equally distributed across the cluster. So a 3 members cluster will have a partition distribution like 91-90-90.
Now when a client sends a record to store in Hazelcast cluster, it already knows which partition does the record belong to by using consistent hashing algorithm. And with that, it also knows which server is the owner of that partition. Hence, the client sends its operation directly to that server. This approach applies on all client operations - put or get. So in your case, you may have several UI clients connected to the cluster but your record for a particular user is stored on one server in the cluster and all your UI clients will be approaching that server for their operations related to that record.
As for consistency, Hazelcast by default is strongly consistent distributed cache, which implies that all your updates to a particular record happen synchronously, in the same thread and the application waits until it has received acknowledgement from the owner server (and the backup server if backups are enabled) in the cluster.
When you connect a DB layer (this could be one or many different types of DBs running in parallel) to the cluster then Hazelcast cluster returns data even if its not currently present in the cluster by reading it from DB. So you never get a null value. On updating, you configure the cluster to send the updates downstream synchronously or asynchronously.
Ah-ha, after some even more thorough study of ES discussions I found this: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-preference.html
Note how they specifically highlight the "custom value" case, recommending to use it exactly to solve my problem.
So, given that's their official recommendation, we can summarise it like this.
To fight volatile reads, we are supposed to use "preference",
with "custom" or some other approach.
To also get "read your
writes" consistency, we can have all clients use
"preference=_primary", because primary shard is first to get all
writes. This however will probably have worse performance than
"custom" mode due to no distribution. And that's quite similar to what other people here said about Ignite and Hazelcast.
Right?
Of course that's a solution specifically for ES. Reverting to my initial question which is a bit more generic, turns out that options #2 and #3 are really considered good enough for many distributed systems, with #3 being possible to achieve with #2 (even without immediate support for #3 by DB).

How can I deal with a sudden loss of a clustered node using the low level .net Elasticsearch client?

I have set up a cluster of three elastic search nodes all master eligible with 2 being the minimum required. I have configured a client to then bulk upload using the low level client with a static connection pool using the code below.
What I am trying to test is live fail over scenarios i.e. start client with three nodes available and then randomly drop one (shutting down the VM), but keep two up. However I am not seeing the behavior I would expect, it keeps trying the dead node. It actually it seems to take up to about sixty seconds before it moves to the next node.
What I would expect is it to do is to take a the failed attempt and mark that node as potentially dead but at least move on to the next node. What is odd is this is the behavior I get if I start my application with only two of the three nodes available in my list or if I just stop the elastic search service during a test rather than a power down.
Is there a correct way to deal with such a case and get it to move to the next available node as quickly as possible? Or do I need to potentially back off in my code for up to sixty seconds before attempting a republication?
var nodes = new[]
{
new Node(new Uri("http://172.16.2.10:9200")),
new Node(new Uri("http://172.16.2.11:9200")),
new Node(new Uri("http://172.16.2.12:9200"))
};
var connectionPool = new StaticConnectionPool(nodes);
var settings = new ConnectionConfiguration(connectionPool)
.PingTimeout(TimeSpan.FromSeconds(10))
.RequestTimeout(TimeSpan.FromSeconds(20))
.ThrowExceptions()
.MaximumRetries(3);
_lowLevelClient = new ElasticLowLevelClient(settings);
The following I then have wrapped in a try catch where I retry for a maximum of three times before I consider it a failed attempt and revert to an error strategy.
ElasticsearchResponse<Stream> indexResponse = _lowLevelClient.Bulk<Stream>(data);
Any input is appreciated,
Thank you.
The tests for the client include tests for failover scenarios from which the API conventions documentation is generated. Specifically, take a look at the retry and failover documentation
With a StaticConnectionPool, the nodes to which requests can be made are static and never refreshed to reflect nodes that may join and leave the cluster, but they will be marked as being dead if a bad response is returned, and will be taken out of rotation for executing requests on for a configurable dead time, controlled by DeadTimeout and MaxDeadTimeout on connection settings.
The audit trail on the response should provide a timeline of what has happened for a given request, which is easiest to see with response.DebugInformation. The Virtual Clustering test harness (an example) that are part of the Tests project may help to ascertain the correct settings for the behaviour you're after.

How do I figure out the new master node when my master node fails in ElasticSearch?

Let's say I have 3 nodes. 1 of which is the master.
I have an API (running on another machine) which hits the master and gets my search result. This is through a subdomain, say s1.mydomain.com:9200 (assume the others are pointed to by s2.mydomain.com and s3.mydomain.com).
Now my master fails for whatever reason. How would my API recover from such a situation? Can I hit either S2 or S3 instead? How can I figure out what the new master is? Is there a predictable way to know which one would be picked as the new master should the master go down?
I've googled this and it's given me enough information about how when a master goes down, a failover is picked as the new master but I haven't seen anything clarify how I would need to handle this from the outside looking in.
The master in ElasticSearch is really only for internal coordination. There are no actions required when a node goes down, other than trying to get it back up to get your full cluster performance back.
You can read/write to any of the remaining nodes and the data replication will keep going. When the old master node comes back up, it will re-join the cluster once it has received the updated data. In fact, you never need to worry if the node you are writing on is the master node.
There are some advanced configurations to alter these behaviors, but ElasticSearch comes with suitable defaults.

Replacing Nagios HTTP with custom (select/poll driven) daemon?

I have a a Nagios configuration which is performing a number of tests on a few hundred nodes; one of these is a variant of check_http. It's not configured to --enable-embedded-perl (ePN) but we'll be changing that soon. Even with ePN enabled I'm concerned about the model where each execution of this Perl HTTP+SSL check will be handling only a single target.
I'd like to write a simple select() (or poll() / epoll()) driven daemon which creates connections to multiple targets concurrently, reads the results and spits out results in a form that's useable to Nagios as if it were results from a passive check.
Is there a guide to how one could accomplish this? What's the interface or API for providing batched check updates to Nagios?
One hack I'm considering would be to have my daemon update a Redis store (with a key for each target, and a short expiration time) and replace check_http with a very small, lightweight GET of the local Redis instance on the key (the GET would either get the actual results for Nagios or a "(nil)" response which will be treated as if the HTTP connection had timed out.
However, I'm also a bit skeptical of my idea since I'd think someone has already something like this by now.
(BTW: I'm ready to be convinced to switch to something like Icinga or Zabbix or Zenoss or OpenNMS ... pretty much anything that will scale better).
As to whether or not to let Nagios handle the scheduling and checks, I'll leave that to you as it varies depending on your version of Nagios (newer versions can run these checks concurrently), and why you want a separate daemon for it. egarding versioning of Nagios, version 3 IIRC uses concurrent checks, and scales thusly to larger node counts than you report.
However, I can answer the Redis route concept as I've done it with Postfix queue stats and TTFB tracking for web sites.
Setting up the check using Python with the curl and multiprocessing modules is fairly straightforward as is dumping it into Redis. An expiration of I'd say no more than the interval would be a solid idea to keep the DB from growing. I'd recommend tis value be no more (or possibly just less than) the check interval to avoid grabbing stale check results. If the currently running check hasn't completed and the Redis-to-Nagios check runs, pulling in the previous check, you can miss failed checks.
For the Redis-To-Nagios check a simple redis-cli+bash scripting or Python check to pull the data for a given host, returning OK or otherwise depending on your data is fairly simple and would run quickly enough.
I'd recommend running the Redis instance on the Nagios check server to ensure minimum latency and avoid a network issue causing false alerts on your checks. I would also recommend a Nagios check on your Redis instance and the checking daemon. Make the check_http replacement check dependent on the Redis and http_check daemons running. THus you have a dependency chain as follows:
Redis -> http_checkd -> http_check_replacement
This will prevent false alerts on http_check_replacement by identifying the problem. For example, if your redis_checkd dies you get alerted to that, not 200+ "failed http_check_replacement" ones.
Also, since your data in Redis is by definition transient, I would disable the disk persistence. No need to write to disk when the data is constantly rotating.
On a side note, I would recommend, if using libcurl, you pull statistics from libcurl about how long it takes to get the connection open and how long the server to to respond (Time To First Byte - TTFB) and take advantage of Nagios's ability to store check statistics. You may well reach a time when having that data is really handy for troubleshooting and performance analysis.
I have a CLI Tool I've written in C which does this and uploads it into a local Redis instance. It is fast - barely more than the time to get the URL. I'm expecting it be open sourced this week, I can add Nagios style output to it fairly easily. In fact, I think I'll do that in the next week or two.

Resources