I'm using the DataStax Cassandra Java driver 2.1.2 to have clients connect to one of three data centers, like so:
.withLoadBalancingPolicy(new TokenAwarePolicy(new DCAwareRoundRobinPolicy("DC1",1)))
This sets DC1 as the local data center, but also has the driver make one connection to the other two remote data centers.
Now if some of the nodes are down in the local data center, the client will fail to get a local quorum on an insert statement, and an UnavailableException will be thrown. But there are sufficient nodes available in the remote data centers for the insert to get a quorum there and succeed, so I would like the driver to retry the insert in the other data centers. But how do I tell the driver to do this?
It looks like there is a way to set a RetryPolicy to retry with a lower consistency level, but I don't see anything about retrying to a remote data center.
If all the nodes in DC1 are down, then the driver does try the insert at a remote data center where it succeeds.
The way I ended up getting this to work is I first try the insert with these settings (note that I'm using the "IF NOT EXISTS" clause on the insert):
statement.setConsistencyLevel(ConsistencyLevel.LOCAL_QUORUM);
statement.setSerialConsistencyLevel(ConsistencyLevel.LOCAL_SERIAL);
This tells Cassandra to only do the insert if it can get a local quorum for both the "IF NOT EXISTS" check and for the write. If there aren't enough replicas alive to get a local quorum, I catch the UnavailableException and NoHostAvailableException exceptions and change the consistency level to:
statement.setConsistencyLevel(ConsistencyLevel.QUORUM);
statement.setSerialConsistencyLevel(ConsistencyLevel.SERIAL);
Then I try the insert again and this time it will try to get a quorum across all the data centers and succeed. So with this approach I get decent performance for most inserts by restricting the very expensive "IF NOT EXISTS" check to the local DC, while getting the reliability of not being dead in the water when some of the local replicas are down.
Related
There is a common case that we will update the clickhouse's config which must restart clickhouse to take effect. And during the restarting, the query services depend on clickhouse's distributed table will return the exception due to disconnecting with the restarting server.
So,as the title says, what I want is the way to make distributed table still work for query when one of the shard server down. Thanks.
I see two ways:
Since this server failure is transient, you can refactor your server-side code by adding retry-policy to your request (for c# I would recommend use Polly)
Use the proxy (load-balancer) to CH (for example chproxy).
UPDATE
When one node is restarting in a cluster the distributed table created over replicated tables should be accessible (of course request shouldn't be sent to restarted node).
Availability of data is achieved by using replication, therefore, you need to create Replicated*-tables over materialized view and then create Distributed-tables over Replicated*-tables.
Please look at the articles CH Data Distribution, Distributed vs Shard vs Replicated..
and as a working example (it is not your case) to CH Circular cluster topology.
I'm currently testing failure scenarios using 3 cockroachDB nodes.
Using this scenario:
Inserting records in a loop
Shutdown 2 nodes out of 3 (to simulate Quorum lost)
Wait long enough so the Postgres JDBC driver throws a IO Exception
Restart one node to bring back Quorum
Retry previous failed statement
I then hit the following exception
Cause: org.postgresql.util.PSQLException: ERROR: duplicate key value (messageid)=('71100358-aeae-41ac-a397-b79788097f74') violates unique constraint "primary"
This means that the insert succeeded on first attempt (from which I got the IO Exception) when the Quorum became available again. Problem is that I'm not aware of it.
I cannot make the assumption that a "duplicate key value" exception will be cause by application logic issues. Is there any parameters I can tuned so the underlying statement rollbacks before the IO Exception ? Or maybe a better approach ?
Tests were conducted using
CockroachDB v1.1.5 ( 3 nodes )
MyBatis 3.4.0
PostgreSQL driver 42.2.1
Java 8
There's a couple things that could be happening here.
First, if one of the nodes you're killing is the gateway node (the one your
Java process is connecting to), it could just be that the data is being
committed, but the node is dying before it's able to send the confirmation back
to the client. In this case, there's not much that can be done by CockroachDB
or any other database.
The more subtle case is where the nodes you're killing are nodes besides
the gateway node. That is, where the node you were talking to sent you back an
error, despite the data being committed successfully. The problem here is that
the data is committed as soon as it's written to raft, but it's possible that
if the other nodes have died (and could come back up later), there's no way for
the gateway node to know whether they have committed the data that it asked
them to. In situations like this, CockroachDB returns an "ambiguous result error".
I'm not sure how jdbc exposes the specifics of the errors returned to the
client in cases like this, but if you inspect the error itself it should say
something to that effect.
Ambiguous results in CockroachDB are briefly discussed in its Jepsen analysis, and see this page in the CockroachDB docs for information on the kinds of errors that can be returned.
I'm trying to connect to Oracle DB from Spark SQL with following code:
val dataTarget=sqlcontext.read.
format("jdbc").
option("driver", config.getString("oracledriver")).
option("url", config.getString("jdbcUrl")).
option("user", config.getString("usernameDH")).
option("password", config.getString("passwordDH")).
option("dbtable", targetQuery).
option("partitionColumn", "ID").
option("lowerBound", "5").
option("upperBound", "499999").
option("numPartitions", "10").
load().persist(StorageLevel.DISK_ONLY)
By default when we connect with Oracle through Spark SQL it'll create one connection for one partition will be created for the entire RDD. This way I loose parallelism and performance issues comes when there is huge data in a Table. In my code I have passed option("numPartitions", "10")
which will create 10 connection. Please correct if I'm wrong as I know, the number of connections with Oracle will be equal to the number of partitions we pass.
I'm getting below error if I use more connection because may be there is a connection limit to Oracle.
java.sql.SQLException: ORA-02391: exceeded simultaneous
SESSIONS_PER_USER limit
To create more partitions for parallelism if I use more partitions, error comes but if I put less I face performance issues. Is there any other way to create a single connection and load data into multiple partitions (this will save my life).
Please suggest.
Is there any other way to create a single connection and load data into multiple partitions
There is not. In general partitions are processed by different physical nodes and different virtual machines. Considering all the authorization and authentication mechanisms, you cannot just take connection and pass it from node to node.
If problem is just in exceeding SESSIONS_PER_USER just contact the DBA and ask for increasing the value for the Spark user.
If problem is throttling you can try to keep the same number partitions, but decrease number of Spark cores. Since this is mostly micromanaging it might be better to drop JDBC completely, use standard export mechanism (COPY FROM) and read the files directly.
One work around might be to load the data using a single Oracle connection (partition) and then simply repartition:
val dataTargetPartitioned = dataTarget.repartition(100);
You can also partition by a field (if partitioning a dataframe):
val dataTargetPartitioned = dataTarget.repartition(100, "MY_COL");
I have a heavy and large mongo table, which has a lot of reads. One of the read clients is an offline process which periodically scans a table aggressively. While other clients read the same table as online service. I'd like to separate them. What I'm thinking is to have a dedicate replica node for this offline client to read from, and then let the other clients read from the remaining replicas. How to do that?
You should consider marking one of the nodes as hidden member of the replica set. It will receive all the replicated writes from primary but won't receive any read traffic (from your online service if you use proper-replicaset-enabled connection string). Then from your offline client you can use connection string which targets the hidden member directly
http://docs.mongodb.org/manual/core/replica-set-hidden-member/
In HBase, how the put/get operations know which region server the row should be written to?
In case of multiple rows to be read how multiple region servers are contacted and the results are retrieved?
I assume your question is simply curiosity, since this behavior is abstracted from the user and you shouldn't care.
In HBase, how the put/get operations know which region server the row should be written to?
From the hbase documentation book:
The HBase client HTable is responsible for finding RegionServers that are serving the particular row range of interest. It does this by querying the .META. and -ROOT- catalog tables (TODO: Explain). After locating the required region(s), the client directly contacts the RegionServer serving that region (i.e., it does not go through the master) and issues the read or write request. This information is cached in the client so that subsequent requests need not go through the lookup process. Should a region be reassigned either by the master load balancer or because a RegionServer has died, the client will requery the catalog tables to determine the new location of the user region.
So first step is looking up in meta and root to determine where it is, then it contacts that regionserver to do that work.
In case of multiple rows to be read how multiple region servers are contacted and the results are retrieved?
There are two ways to read from HBase in general: scanners and gets.
If you run multiple gets, those will each individually fetch those records separately. Each one of those is possibly going to a different region server.
The scanner will simply look for the start of the range and then move forward from there. Sometimes it needs to move to a different regionserver when it reaches the end, but the client handles that behind the scenes. If there is some way to design the table such that your multiple gets is a scan and not a series of gets, you should hypothetically have better performance.
Providing the same scenario and explanation from BigTable Paper: "The client library caches tablet locations. If the client
does not know the location of a tablet, or if it discovers
that cached location information is incorrect, then
it recursively moves up the tablet location hierarchy.
If the client's cache is empty, the location algorithm
requires three network round-trips, including one read
from Chubby. If the client's cache is stale, the location
algorithm could take up to six round-trips, because stale
cache entries are only discovered upon misses (assuming
that METADATA tablets do not move very frequently).
Although tablet locations are stored in memory, so no
GFS accesses are required, we further reduce this cost
in the common case by having the client library prefetch
tablet locations: it reads the metadata for more than one
tablet whenever it reads the METADATA table."
http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/bigtable-osdi06.pdf