a cron job that was successfully running for years suddenly started dying after about 80% completion. Not sure if it is because the collection with results was steadily growing and reached some critical size (does not seem to be all that big to me) or for any other reason. I am not sure how to debug this, I found the user at whom the job died and tried to run the job for this user, got CURSOR_NOTFOUND message after 2 hours. Yesterday it died after 3 hours of running for all users. I am still using old mongoid (2.0.0.beta) because of multiple dependences and lack of time to change it, but mongo is up to date (I know about the bug in versions before 1.1.2).
I found two similar questions but neither of them is applicable. In this case, they used Mopped which was not production ready. And here the problem was in pagination.
I am getting this error message
MONGODB cursor.refresh() for cursor xxxxxxxxx
rake aborted!
Query response returned CURSOR_NOT_FOUND. Either an invalid cursor was specified, or the cursor may have timed out on the server.
Any suggestions?
A "cursor not found" error from MongoDB is typically an indication that the cursor timed out (after 10 minutes of inactivity) but it could potentially indicate that the client code has become confused and is using a stale or closed cursor or has corrupted the cursor somehow. If the 3 hour runtime included a lot of busy time on the client in between calls to MongoDB, that might give the server time to timeout the cursor.
You can specify a no-timeout option on the cursor to see if it is a server timeout of your cursor that is causing your problem.
Related
I have an issue with Redis that effects running of Laravel Horizon Queue and I am unsure how to debug it at this stage, so am looking for some advice.
Issue
Approx. every 3 - 6 weeks my queues stop running. Every time this happens, the first set of exceptions I see are:
Redis Exception: socket error on read socket
Redis Exception: read error on connection to 127.0.0.1:6379
Both of these are caused by Horizon running the command:
artisan horizon:work redis
Theory
We push around 50k - 100k jobs through the queue each day and I am guessing that Redis is running out of resources over the 3-6 week period. Maybe general memory, maybe something else?
I am unsure if this is due to a leak wikthin my system or something else.
Current Fix
At the moment, I simply run the command redis-cli FLUSHALL to completely clear the database and we are back working again for another 3 - 6 weeks. This is obviously not a great fix!
Other Details
Currently Redis runs within the webserver (not a dedicated Redis server). I am open to changing that but it is not fixing the root cause of the issue.
Help!
At this stage, I am really unsure where to start in terms of debugging and identifing the issue. I feel that is probably a good first step!
I have been using oci8 for over a year now for several batch processes. There I used to make oracle calls based on a particular frequency without any high number of parallel requests. Recently I started using this driver to process multiple number of user requests in parallel using go routines. The connections go through 90% of the times but for remaining 10% I see an error driver: bad connection being thrown from this driver. This is generally happening in two situations:
When the connection was left idle for too long(happens for few requests).
When there is a spike in number of connections.
Actions taken:
Already checked with my oracle DB for connection/session limits. There is no such limit on it.
Tried forking the branch and adding error logs which didn't seem to compile.
Most of the people who have faced this issue mentioned wrong handling of multiple connections at the same time. For me that is something done by oci8.
Please help!
I registered a Changefeeds to one of my RethinkDB tables, and it's been working perfectly for the past 6 days. Data get posted to my rethinkDB table every 2 minutes, and the Changefeeds has no problem getting them. But yesterday due to some reason the data uploading stopped for about an hour. During that period, Changefeeds didn't get any changes, which is expected. The problem is that after the data uploading restarted, I still cannot get any changes from rethinkDB. It is lost forever. The problem is gone when I restarted my program. The code is like:
def get_rdb_connection():
"Helper function for get rdb connection"
try:
rdb_conn = rdb.connect(host='rethinkdb_ip', port=28015, db='test')
logger.info('connection established.')
return rdb_conn
except RqlDriverError:
logger.err("No rdb database connection could be established.")
return
conn = get_rdb_connection()
feed = rdb.table('skydata').changes(squash=False).run(conn)
for change in feed:
logger.info("change detected")
""" do some stuff """
The output:
change detected
change detected
change detected
...
change detected
Then the output just stopped.
I'm pretty sure the code in """ do some stuff """ doesn't cause any block, since it is pretty simple and has been running for weeks. And Supervisord status also shows it is always in running state.
So I wonder if there's timeout-like mechanism, if there's no change for a certain period, it will stop listening?
Edit:
I think it may be due to connection loss since there's no change for more than an hour and the program is just doing nothing. I'm not sure how rethinkdb manages the connection to its database. Is the connection kept alive automatically? Will it be reestablished if it gets lost?
The connection shouldn't be lost in that scenario, or if it is you should receive an error in your driver rather than the changefeed silently dropping changes.
On the face of it this sounds like a bug; I opened https://github.com/rethinkdb/rethinkdb/issues/4572 to track it. If it isn't too much trouble, could you add some more details on your setup to that issue? (What OS you're using, what your network setup is, whether this bug is regularly reproducible for you, etc.)
I have a .NET Service which keeps polling Oracle Database to get records every 2 mins.
But service stops communicating with Oracle after few hours of run, and throws exception.
I verified at DB level and found there were 155 INACTIVE sessions. I restarted my service and then when I checked there were around 70 INACTIVE sessions for my service.
This process is causing an exception in my service and hence interrupting the work. Can anyone please help me in understanding where is the problem?
Why do not it closes the session or re-use the existing one.
I came to know that in my code Connections to database were not getting closed. I analysed again the whole thing, and closed the connection sin finally block. And it is working smoothly now. Thanks.
A number of stored procedures I support query remote databases over a WAN. The network occasionally goes down, but the worst that ever happened was the procedures failed and would have to be restarted.
The last couple weeks it's taken a sinister turn. Instead of failing the procedures hang in a wierd locked state. They can't be killed inside of Oracle and as long as they exist any attempt to run other copies of the procedure will hang too. The only solution we've found is to kill the offending procedures with a "kill -9" from the OS. Some of these procedures haven't been changed for months, even years, so I suspect a root cause in the DB or DB configuration.
Any one have any ideas of what we can do to either fix the problem? Or does PL/SQL have a time-out mechanism I can add to the code so that I can create an exception that I can handle programatically?
What database version ? Are they stuck running SQL or in PL/SQL ?
Has anyone added exception handling into the routines recently ?
I remember in 9iR2, we were told that, instead or raising an exception to the calling routine, we were to catch all exceptions and keep running (basically try to run process all the items in the job even if some fail).
We inevitably had jobs get stuck in an infinite loop with SQLs failing, getting caught by the exception handler and trying again. And they couldn't be killed as the WHEN OTHERS also caught the 'your session has been killed' exception. I think the latter changed in 10g so that exception didn't get caught.
We were never able to determine what caused this to happen. We believe it was a defect in the October 2008 cumulative patch. Perhaps a later patch as fixed it. It hasn't happened for a couple months (and we've had some network outages) so hopefully the problem has gone away.