Syslog Drain created in PCF throwing error - elasticsearch

Hi we have created a Syslog Drain from pcf to logstash but sometimes we are getting 2018-07-19T15:09:53.524+05:30 [LGR/] [ERR] Syslog Drain: Error when writing. Backing off for 4ms.
this error.
What is this and why?

I suspect that it's a communication problem with the logging system in your Cloud Foundry platform as it's trying to talk with your LogStash. The message doesn't give you an exact error though. To find that, you would need to be a platform operator and look at the Loggregator logs to see why it's failing. If you're not the CF platform operator, reach out to your operator for assistance.
When you see errors like this I would suggest checking for two things:
How often do you see this message?
How large does the number in "Backing off for XXms." get?
When an error occurs sending logs the platform will back off, but as errors continue to occur the backoff timeout will get larger. If you see a large value in the backoff timeout, that means you have a prolonged problem. This could be something like you've configured the log drain incorrectly, your LogStash server is down or the network to it is down. If you see the errors frequently, but the number stays low, it means it's only intermittently failing (some logs go OK, some don't) which could point to a flaky network connection, one that's up/down a lot.

Related

stuck thread in createQueueConnection

I am getting the stuck thread error often while trying to send the JMS message to another manager server within the domain in our production environment.
Initially, we felt that it might be due to Load on the server but issue occurred randomly Even at the time of less load and system processing well some high volume time
We are not able to find the reason for the same.
Error Information:
weblogic.jms.client.JMSConnectionFactory.createQueueConnection(JMSConnectionFact
ory.java:199)

How to limit Couchbase client from trying to connect to Couchbase server when it's down?

I'm trying to handle Couchbase bootstrap failure gracefully and not fail the application startup. The idea is to use "Couchbase as a service", so that if I can't connect to it, I should still be able to return a degraded response. I've been able to somewhat achieve this by using the Couchbase async API; RxJava FTW.
Problem is, when the server is down, the Couchbase Java client goes crazy and keeps trying to connect to the server; from what I see, the class that does this is ConfigEndpoint and there's no limit to how many times it tries before giving up. This is flooding the logs with java.net.ConnectException: Connection refused errors. What I'd like, is for it to try a few times, and then stop.
Got any ideas that can help?
Edit:
Here's a sample app.
Steps to reproduce the problem:
svn export https://github.com/asarkar/spring/trunk/beer-demo.
From the beer-demo directory, run ./gradlew bootRun. Wait for the application to start up.
From another console, run curl -H "Accept: application/json" "http://localhost:8080/beers". The client request is going to timeout due to the failure to connect to Couchbase, but Couchbase client is going to flood the console continuously.
The reason we choose to have the client continue connecting is that Couchbase is typically deployed in high-availability clustered situations. Most people who run our SDK want it to keep trying to work. We do it pretty intelligently, I think, in that we do an exponential backoff and have tuneables so it's reasonable out of the box and can be adjusted to your environment.
As to what you're trying to do, one of the tuneables is related to retry. With adjustment of the timeout value and the retry, you can have the client referenceable by the application and simply fast fail if it can't service the request.
The other option is that we do have a way to let your application know what node would handle the request (or null if the bootstrap hasn't been done) and you can use this to implement circuit breaker like functionality. For a future release, we're looking to add circuit breakers directly to the SDK.
All of that said, these are not the normal path as the intent is that your Couchbase Cluster is up, running and accessible most of the time. Failures trigger failovers through auto-failover, which brings things back to availability. By design, Couchbase trades off some availability for consistency of data being accessed, with replica reads from exception handlers and other intentionally stale reads for you to buy into if you need them.
Hope that helps and glad to get any feedback on what you think we should do differently.
Solved this issue myself. The client I designed handles the following use cases:
The client startup must be resilient of CB failure/availability.
The client must not fail the request, but return a degraded response instead, if CB is not available.
The client must reconnect should a CB failover happens.
I've created a blog post here. I understand it's preferable to copy-paste rather than linking to an external URL, but the content is too big for an SO answer.
Start a separate thread and keep calling ping on it every 10 or 20 seconds, one CB is down ping will start failing, have a check like "if ping fails 5-6 times continuous then close all the CB connections/resources"

WSO2 log4j and elasticsearch: all carbon apps freeze

I've noticed a very strange behavior in my wso2 apps ( esb 4.9, AM 1.10 and GREG 5.0.0)
Every single time the elasticsearch/logstash is stopped all the carbon apps freeze.
They become completely unresponsive and the only way to stop them is send a kill -9
My conf is pretty standard (see below) so I was wondering if I'm missing something or if someone else noticed the same issue.
log4j.rootLogger=INFO, CARBON_CONSOLE, CARBON_LOGFILE, CARBON_MEMORY,tcp
log4j.appender.tcp=org.apache.log4j.net.SocketAppender
log4j.appender.tcp.layout=org.wso2.carbon.utils.logging.TenantAwarePatternLayout
log4j.appender.tcp.layout.ConversionPattern=[%d] %P%5p {%c} – %x %m%n
log4j.appender.tcp.layout.TenantPattern=%U%#%D[%T]
log4j.appender.tcp.Port=6000
log4j.appender.tcp.RemoteHost=localhost
log4j.appender.tcp.ReconnectionDelay=10000
log4j.appender.tcp.threshold=DEBUG
log4j.appender.tcp.Application=esb500wso2carbon
What says the documentation :
Logging events are automatically buffered by the native TCP
implementation. This means that if the link to server is slow but
still faster than the rate of (log) event production by the client,
the client will not be affected by the slow network connection.
However, if the network connection is slower then the rate of event
production, then the client can only progress at the network rate. In
particular, if the network link to the the server is down, the client
will be blocked.
On the other hand, if the network link is up, but the
server is down, the client will not be blocked when making log
requests but the log events will be lost due to server unavailability.
But in my case, even when the "server is down", the client is blocked sometimes because many java threads are blocked on the same lock object
Have a look to JMSAppender or AsyncAppender
According to WS02 is a bug.
It doesn't affect version 5.x
The suggested workaround, successfully tested, is to use filebeat instead :(
Not ideal, but it works

licenseNotification API has stopped working

Calling https://www.googleapis.com/appsmarket/v2/licenseNotification/[myAppId] has suddenly started returning 500 Backend Error every time.
500 error is a generic message, given when no more specific message is suitable. There are a number of causes for a 500 Internal Server Error to display in a web browser.
With any error message, particularly one as broad as the 500 Internal Server Error, you will first want to check your error logs for your server. These logs can provide valuable context related to any code failures or other potential causes of a site failure.
As stated in the answer in this SO question.
In general, every Backend error should be handled with an exponential retry, as
there might be service problems.
If the error still persists after let's say 10 hours then you should
contact the support in order to provide you 1:1 help to your problem.

Heroku, apparent silent failure of sucker_punch

My app runs on Heroku with unicorn and uses sucker_punch to send a small quantity of emails in the background without slowing the web UI. This has been working pretty well for a few weeks.
I changed the unicorn config to the Heroku recommended config. The recommended config
includes an option for the number of unicorn processes and I upped the number of processes from 2 to 3.
Apparently that was too much. The sucker_punch jobs stopped running. I have log messages that indicate when they are queued and I have messages that indicate when they start processing. The log shows them being queued but the processing never starts.
My theory is that I exceeded memory by going from 2 to 3 unicorns.
I did not find a message anywhere indicating a problem.
Q1: should I expect to find a failure messsage somewhere? Something like "attempting to start sucker_punch -- oops, not enough memory"?
Q2: Any suggestions on how I can be notified of a failure like this in the future.
Thanks.
If you are indeed exceeding dyno memory, you should find R14 or R15 errors in your logs. See https://devcenter.heroku.com/articles/error-codes#r14-memory-quota-exceeded
A more likely problem, though, given that you haven't found these errors, is that something within the perform method of your sucker punch worker is throwing an exception. I've found sucker punch tasks to be a pain to debug because it appears the lib swallows all exceptions silently. Try instantiating your task and calling perform on it from a rails console to make sure that it behaves as you expect.
For example, you should be able to do this without causing an exception:
task = YourTask.new
task.perform :something, 55

Resources