RethinkDB Changefeeds Lost Connection - rethinkdb

I registered a Changefeeds to one of my RethinkDB tables, and it's been working perfectly for the past 6 days. Data get posted to my rethinkDB table every 2 minutes, and the Changefeeds has no problem getting them. But yesterday due to some reason the data uploading stopped for about an hour. During that period, Changefeeds didn't get any changes, which is expected. The problem is that after the data uploading restarted, I still cannot get any changes from rethinkDB. It is lost forever. The problem is gone when I restarted my program. The code is like:
def get_rdb_connection():
"Helper function for get rdb connection"
try:
rdb_conn = rdb.connect(host='rethinkdb_ip', port=28015, db='test')
logger.info('connection established.')
return rdb_conn
except RqlDriverError:
logger.err("No rdb database connection could be established.")
return
conn = get_rdb_connection()
feed = rdb.table('skydata').changes(squash=False).run(conn)
for change in feed:
logger.info("change detected")
""" do some stuff """
The output:
change detected
change detected
change detected
...
change detected
Then the output just stopped.
I'm pretty sure the code in """ do some stuff """ doesn't cause any block, since it is pretty simple and has been running for weeks. And Supervisord status also shows it is always in running state.
So I wonder if there's timeout-like mechanism, if there's no change for a certain period, it will stop listening?
Edit:
I think it may be due to connection loss since there's no change for more than an hour and the program is just doing nothing. I'm not sure how rethinkdb manages the connection to its database. Is the connection kept alive automatically? Will it be reestablished if it gets lost?

The connection shouldn't be lost in that scenario, or if it is you should receive an error in your driver rather than the changefeed silently dropping changes.
On the face of it this sounds like a bug; I opened https://github.com/rethinkdb/rethinkdb/issues/4572 to track it. If it isn't too much trouble, could you add some more details on your setup to that issue? (What OS you're using, what your network setup is, whether this bug is regularly reproducible for you, etc.)

Related

Redis intermittent "crash" on Laravel Horizon. Redis stops working every few weeks/months

I have an issue with Redis that effects running of Laravel Horizon Queue and I am unsure how to debug it at this stage, so am looking for some advice.
Issue
Approx. every 3 - 6 weeks my queues stop running. Every time this happens, the first set of exceptions I see are:
Redis Exception: socket error on read socket
Redis Exception: read error on connection to 127.0.0.1:6379
Both of these are caused by Horizon running the command:
artisan horizon:work redis
Theory
We push around 50k - 100k jobs through the queue each day and I am guessing that Redis is running out of resources over the 3-6 week period. Maybe general memory, maybe something else?
I am unsure if this is due to a leak wikthin my system or something else.
Current Fix
At the moment, I simply run the command redis-cli FLUSHALL to completely clear the database and we are back working again for another 3 - 6 weeks. This is obviously not a great fix!
Other Details
Currently Redis runs within the webserver (not a dedicated Redis server). I am open to changing that but it is not fixing the root cause of the issue.
Help!
At this stage, I am really unsure where to start in terms of debugging and identifing the issue. I feel that is probably a good first step!

How resilient is reporting to Trains server?

How would Trains go about sending any missing data to the server in the following scenarios?
Internet connection breaks temporarily while running an experiment
Internet connection breaks and doesn't come back before the experiment ends (any manual way to send all the data that was missed?)
The machine running Trains server resets in the middle of an experiment
Disclaimer: I'm part of the allegro.ai Trains team
Trains will auto retry to send logs, basically forever. The logs/metrics are sent in a background thread so it should not interfere with execution. You can set the backoff parameter, to control the retry frequency, by adjusting the sdk.network.iteration.retry_backoff_factor_sec parameter in your ~/trains.conf file, see example here
The experiment will try to flush all metrics to the backend when the experiment ends, i.e. the process will wait at_exit until all metrics are sent. This means if the connection was dropped, it will retry until it is up again. If the experiment was aborted manually, there is no way to capture/resend those lost metric reports. That said with the new 0.16 version, offline mode was introduced. This way one can run the entire experiment offline, then later report all logs/metrics/artifacts.
The Trains-Server machine is fully stateless (the states themselves are stored in the databases on the machine) this means that from the experiment perspective, the connection was dropped for a few minutes and then it's available again. To your question, if the Trains-Server restarted, it is transparent to all experiments and they continue as usual, no reports will be lost.

How to limit Couchbase client from trying to connect to Couchbase server when it's down?

I'm trying to handle Couchbase bootstrap failure gracefully and not fail the application startup. The idea is to use "Couchbase as a service", so that if I can't connect to it, I should still be able to return a degraded response. I've been able to somewhat achieve this by using the Couchbase async API; RxJava FTW.
Problem is, when the server is down, the Couchbase Java client goes crazy and keeps trying to connect to the server; from what I see, the class that does this is ConfigEndpoint and there's no limit to how many times it tries before giving up. This is flooding the logs with java.net.ConnectException: Connection refused errors. What I'd like, is for it to try a few times, and then stop.
Got any ideas that can help?
Edit:
Here's a sample app.
Steps to reproduce the problem:
svn export https://github.com/asarkar/spring/trunk/beer-demo.
From the beer-demo directory, run ./gradlew bootRun. Wait for the application to start up.
From another console, run curl -H "Accept: application/json" "http://localhost:8080/beers". The client request is going to timeout due to the failure to connect to Couchbase, but Couchbase client is going to flood the console continuously.
The reason we choose to have the client continue connecting is that Couchbase is typically deployed in high-availability clustered situations. Most people who run our SDK want it to keep trying to work. We do it pretty intelligently, I think, in that we do an exponential backoff and have tuneables so it's reasonable out of the box and can be adjusted to your environment.
As to what you're trying to do, one of the tuneables is related to retry. With adjustment of the timeout value and the retry, you can have the client referenceable by the application and simply fast fail if it can't service the request.
The other option is that we do have a way to let your application know what node would handle the request (or null if the bootstrap hasn't been done) and you can use this to implement circuit breaker like functionality. For a future release, we're looking to add circuit breakers directly to the SDK.
All of that said, these are not the normal path as the intent is that your Couchbase Cluster is up, running and accessible most of the time. Failures trigger failovers through auto-failover, which brings things back to availability. By design, Couchbase trades off some availability for consistency of data being accessed, with replica reads from exception handlers and other intentionally stale reads for you to buy into if you need them.
Hope that helps and glad to get any feedback on what you think we should do differently.
Solved this issue myself. The client I designed handles the following use cases:
The client startup must be resilient of CB failure/availability.
The client must not fail the request, but return a degraded response instead, if CB is not available.
The client must reconnect should a CB failover happens.
I've created a blog post here. I understand it's preferable to copy-paste rather than linking to an external URL, but the content is too big for an SO answer.
Start a separate thread and keep calling ping on it every 10 or 20 seconds, one CB is down ping will start failing, have a check like "if ping fails 5-6 times continuous then close all the CB connections/resources"

Why are my ADODB queries not being persisted to the SQL server?

My VB6 program uses ADODB to do a lot of SQL (2000) CRUD.
Sometimes the network connection between the remote clients and the data center somehow "drops" resulting in the impossibility to establish new connections (so users launching the program can't use it).
The issue is the following:
Anyone who is using the program at the moment of the "drop" can continue using it with no issues whatoever, perform every operation, update data, read data, and everything seems like is working normally.
User then proceeds to fire up a "sum-up" report which lists everything that was done (before or after the "drop").
If we then check the database, all data regarding whatever was done after the network drop is not there. User goes back into the program and everything is as it was before the network drop.
It seems like all queries where somehow performed in-memory ? I'm at a loss about how to even approach the issue (I'm familiar enough with VB6 to work with the source code but I don't know a lot about ADODB).
I haven't yet tried to replicate the behavior due to limited customer's availability (development environment is housed in their offices), I'll try starting up the program from the IDE then rip out the network cable.
Provided I can replicate the issue, how do I fix this ? Is there some setting I'm not aware of ?
On a side note: the issue is sporadic (it happened a handful of times during the last year, and the software is being used heavily and on a daily basis by mutiple concurrent users).
After reading up on Disconnected Recordsets, it seems that's what's behind this odd behavior I'm experiencing.
This is not something that can be simply "turned off".

CURSOR_NOT_FOUND - my cron jobs started dying in the middle

a cron job that was successfully running for years suddenly started dying after about 80% completion. Not sure if it is because the collection with results was steadily growing and reached some critical size (does not seem to be all that big to me) or for any other reason. I am not sure how to debug this, I found the user at whom the job died and tried to run the job for this user, got CURSOR_NOTFOUND message after 2 hours. Yesterday it died after 3 hours of running for all users. I am still using old mongoid (2.0.0.beta) because of multiple dependences and lack of time to change it, but mongo is up to date (I know about the bug in versions before 1.1.2).
I found two similar questions but neither of them is applicable. In this case, they used Mopped which was not production ready. And here the problem was in pagination.
I am getting this error message
MONGODB cursor.refresh() for cursor xxxxxxxxx
rake aborted!
Query response returned CURSOR_NOT_FOUND. Either an invalid cursor was specified, or the cursor may have timed out on the server.
Any suggestions?
A "cursor not found" error from MongoDB is typically an indication that the cursor timed out (after 10 minutes of inactivity) but it could potentially indicate that the client code has become confused and is using a stale or closed cursor or has corrupted the cursor somehow. If the 3 hour runtime included a lot of busy time on the client in between calls to MongoDB, that might give the server time to timeout the cursor.
You can specify a no-timeout option on the cursor to see if it is a server timeout of your cursor that is causing your problem.

Resources