splunk logback/log4j appender does not seem to log splunk - aws-lambda

I need to log to splunk from AWS Lambda using Java8 runtime. It uses spring framework and I added logback splunk appender to the project. There are no errors and the logs don't seem to show up in splunk. The splunk admin mentioned that there are no requests received on splunk server. When I tried to invoke the REST API manually, the log show up in splunk. So the connectivity from AWS Lambda to splunk server is good. The splunk appender seem to be invoking the API in async fashion and I added a 50seconds sleep at the end of the AWS Lambda code to see whether it is an issue with VM exiting before the async step completes. No luck yet. How do I debug further?
Code snippet:-
public class LambdaApp implements RequestHandler<String, Object>
{
private static final Logger LOGGER = LoggerFactory.getLogger(LambdaApp.class);
private static final Logger SPLUNK_LOGGER = LoggerFactory.getLogger("splunk.logger");
#Override
public Object handleRequest(String event, Context context)
{
SPLUNK_LOGGER.info("AWS Lambda start");
try {
Thread.sleep(50000);
} catch(InterruptedException ex) {
Thread.currentThread().interrupt();
}
return "handled";
}
Maven dependency:-
<dependency>
<groupId>com.splunk.logging</groupId>
<artifactId>splunk-library-javalogging</artifactId>
<version>1.5.2</version>
</dependency>
Logback configuration:-
<appender name="http" class="com.splunk.logging.HttpEventCollectorLogbackAppender">
    <url>https://a.b.c.d:8088</url>
    <token>valid-token</token>
    <disableCertificateValidation>true</disableCertificateValidation>
    <layout class="ch.qos.logback.classic.PatternLayout">
        <pattern>{%msg}</pattern>
    </layout>
</appender>
<logger name ="splunk.logger" level="DEBUG">
<appender-ref ref="http" />
</logger>

The first step is to add batch_size_count to rule out any issues with the HttpEventCollectorLogbackAppender not flushing to Splunk.
<appender name="http" class="com.splunk.logging.HttpEventCollectorLogbackAppender">
<url>https://a.b.c.d:8088</url>
<token>valid-token</token>
<batch_size_count>1</batch_size_count>
<disableCertificateValidation>true</disableCertificateValidation>
<layout class="ch.qos.logback.classic.PatternLayout">
<pattern>{%msg}</pattern>
</layout>
</appender>
You should also verify that you are using Splunk 6.3+ on the receiving end since HTTP Event Collector requires a minimum of v6.3

You have to specify a valid / existing Splunk index or the log entry will be dropped silently on the floor.
Setting batch_size_count to 1 will make sure that every log entry gets flushed. Unfortunately everything's written asynchronously no matter how you configure it (even though "sequential" is the default sendMode), so flushing isn't sufficient to make sure it's actually flushed. For that you need to set terminationTimeout to a value (in milliseconds) that's "long enough", but not too long to significantly impact your lambda.
Note that terminationTimeout is applied on each flush and not just during whatever they think is "termination". They also use busy polling to implement it. It was apparently implemented by someone that doesn't really understand how to write multi-threaded code or the implications of a thread spinning every few milliseconds to poll the state of another thread.

Related

Missing logs from GKE on Cloud logging

I'm having an issue with some missing logs from GKE container in Cluod Logging.
I have an Spring boot application deployed on GKE with Log4j2. All the logs generated by the application are always writted to Cloud Logging, so if I execute 100 transactions in parallel using Jmeter I can search for all the logs in Cloud logging without problems (Logs at the beginning, middle and end of the rest controller).
Now I am migrating from Log4j2 to Logback to have a full integration with Cloud Logging, I'm following this guide: https://docs.spring.io/spring-cloud-gcp/docs/current/reference/html/logging.html
After the migration, updating only the log dependency from Log4j to Logback I can still see my logs on Cloud Logging but I'm having a weird issue with some missing logs.
For example if I send 30 parallel transactions using Jmeter I can see all the logs generated by the service, basically I'm searching for each message like this:
"This is a message "
"This is the mid of controller"
"End of trx, cleaning MDC context : "
Loggers looks like this:
Logger.info("Starting transaction: ", transactionId).
Logger.info("This is the mid of controller").
Logger.info("End of trx, cleaning MDC context : ", transactionId).
MDC.clear();
return response.
I'm searching for messages generated at the start of the rest controller, some logs at the middle of the controller and logs generated at the end of the controller, just before the "return reponse."
So if I send 30 trx in parallel using Jmeter I can find all the Loggers in Cloud Logging, but if I repeat the same 30 trx 1 min later I can find logs, but not all the logs. For example I can find:
30 of **Starting transaction:**,
22 of "This is the mid of controller"
2 of "End of trx, cleaning MDC context : "
Then if I repeat
20 of **Starting transaction:**,
0 of "This is the mid of controller"
0 of "End of trx, cleaning MDC context : "
If I wait 5 minutes and repeat
30 of **Starting transaction:**,
30 of "This is the mid of controller"
30 of "End of trx, cleaning MDC context : "
Even in some cases I can't literally find 0 logs for an specific transaction.
In all the cases the response of the service is always good, I mean even when I can't see all the logs I know the service is working fine because I can receive a 200 success and the expected response in the body. Also there are no inconsistencies in the response, everything is just working fine.
Sorry for the long intro but now the questions.
1 - Is Cloud Logging skipping similar logs? I'm always sending the same transaction in jmeter for all the cases, so the only difference between transactions is the transactionId (generated at the beginning of the rest controller)
2 - If I send a request manually using postman, I can find all the logs. Could Cloud Logging be skipping similar logs generated almost at the same time with parallel transactions?
I have tested the same cases on my local and everything is working fine, even if I send 100 transactions in parallel each second in a long loop I can find all the logs generated by the service (I'm wirtting the logs to a file), so I'm only having this issue in GKE.
Also I understand that #RestController is thread safe, so I'm not seeing inconsistencies in the logs or responses.
I'm using MDC with the configuration in Logback includeMDC, basically I'm adding the transactionId to the MDC context MDC.put("transactionId", transactionId), if I'm not wrong MDC is also thread safe, so it should not be the problem.
My logback file looks like this.
<?xml version="1.0" encoding="UTF-8"?>
<configuration>
<include resource="org/springframework/cloud/gcp/autoconfigure/logging/logback-appender.xml"/>
<include resource="org/springframework/boot/logging/logback/defaults.xml"/>
<include resource="org/springframework/boot/logging/logback/console-appender.xml"/>
<appender name="CONSOLE_JSON_APP" class="ch.qos.logback.core.ConsoleAppender">
<encoder class="ch.qos.logback.core.encoder.LayoutWrappingEncoder">
<layout class="org.springframework.cloud.gcp.logging.StackdriverJsonLayout">
<includeTraceId>true</includeTraceId>
<includeSpanId>true</includeSpanId>
<includeLevel>true</includeLevel>
<includeThreadName>true</includeThreadName>
<includeMDC>true</includeMDC>
<includeLoggerName>true</includeLoggerName>
<includeContextName>true</includeContextName>
<includeMessage>true</includeMessage>
<includeFormattedMessage>true</includeFormattedMessage>
<includeExceptionInMessage>true</includeExceptionInMessage>
<includeException>true</includeException>
<serviceContext>
<service>APP-LOG</service>
</serviceContext>
</layout>
</encoder>
</appender>
<appender name="CONSOLE_JSON_EXT" class="ch.qos.logback.core.ConsoleAppender">
<encoder class="ch.qos.logback.core.encoder.LayoutWrappingEncoder">
<layout class="org.springframework.cloud.gcp.logging.StackdriverJsonLayout">
<projectId>${projectId}</projectId>
<includeTraceId>true</includeTraceId>
<includeSpanId>true</includeSpanId>
<includeLevel>true</includeLevel>
<includeThreadName>true</includeThreadName>
<includeMDC>true</includeMDC>
<includeLoggerName>true</includeLoggerName>
<includeContextName>true</includeContextName>
<includeMessage>true</includeMessage>
<includeFormattedMessage>true</includeFormattedMessage>
<includeExceptionInMessage>true</includeExceptionInMessage>
<includeException>true</includeException>
<serviceContext>
<service>EXT-LOG</service>
</serviceContext>
</layout>
</encoder>
</appender>
<!-- Loggers-->
<root level="INFO" name="info-log">
<appender-ref ref="LOCAL_EXTERNAL_DEP"/>
</root>
<logger name="com.example.test.service" level="INFO" additivity="false">
<appender-ref ref="LOCAL_APP" />
</logger>
</configuration>
The restController looks like this.
#RestController
public class TestServiceController {
#PostMapping("/evaluate")
public Response evaluate(#RequestBody Request request) {
UUID transactionId = UUID.randomUUID();
Logger.info("Starting transaction: ", transactionId ).
MDC.put("transactionId", transactionId.toString());
//Some java code here (Only simple things)
Logger.info("This is the mid of controller").
//Some java code here (Only simple things)
Logger.info("End of trx, cleaning MDC context : ", transactionId).
MDC.clear();
return transaction.getResponse();
}
}
At this moment my only guess is that Cloud Logging is skipping similar logs generated in a short period of time (Basically parallels executions).
Try adjusting the flushing settings. For example, set flushLevel to DEBUG. Docs about flushLevel: https://docs.spring.io/spring-cloud-gcp/docs/current/reference/html/logging.html#_log_via_api
I've seen the issue you described when applications aren't configured to flush logs directly to stdout/stderr.

3rd party logging not picked up by SLF4jBridgeHandler which is attached to root logger but then removed

Our app has 3rd party code that uses JUL to log. Our app has the jul-to-slf4j.jar on the classpath. I see the SLF4JBridgeHandler.install() method gets called (several times) as for whatever reason spring cleans up logging config a few times on based on certain ApplicationEvents. I created an event listener on ApplicationStartedEvent to again make sure the handler is installed as late as possible. I verify by checking the SLF4JBridgeHandler.isInstalled() method. Yet the 3rd party logs keep flowing to standard out. If i breakpoint the 3rd party logging class in IntelliJ and call SLF4JBridgeHandler.isInstalled() it returns false. If I then execute the below code the logs get picked up by slf4j and everyone is happy.
SLF4JBridgeHandler.install();
LogManager.getLogManager().getLogger("com.3rdparty").setUseParentHandlers(false);
Why does SLF4J keep getting uninstalled? How do I keep it installed? Are the 3rd party loggers getting created before we install slf4j and then not getting updated by the install() call?
Why does SLF4J keep getting uninstalled?
Most common reason is there is code calling LogManager.reset(). You can find the source by using a security manager.
If you user a logging.properties file then just add the bridge as a handler. That way reset is actually installing the bridge handler.
handlers= java.util.logging.ConsoleHandler, org.slf4j.bridge.SLF4JBridgeHandler
The only downside is that the bridge handler has to be visible to the system classloader. A very hacky trick you can try is to subclass SLF4JBridgeHandler and override equals to return false. LogManager.reset won't be able to remove this handler.
Logger.getLogger("").addHandler(new SLF4JBridgeHandler() {
public boolean equals(Object o) { return false;}
});
Are the 3rd party loggers getting created before we install slf4j and then not getting updated by the install() call?
You'll have to do some work to find out that answer. Use JConsole Mbeans or the system out debugging to walk the logger tree while debugging your application. The security manager tracing is your best bet for finding the exact stacktrace for who is responsible for removing the handler.
Can there be two root loggers somehow?
That can't happen under the standard LogManager. Tomcat has an implementation that is sensitive to the context classloader. That LogManager may return different instances of the same logger name based on context class loader. The only other way the handlers can be removed without going through the security manager is if the root logger was garbage collected. You could try to hold a strong reference to the root logger but that really shouldn't be needed. You should print the identity hash code of the root logger when the handler was known to be added and the identity hash code of the root logger after the handler was dropped. If the values are different the logger was garbage collected.

Custom HealthIndicator not invoked during startup

I implemented a custom HealthIndicator for our application, which is working fine.
I noticed that when I run the application through my IDE (tested with both IntelliJ and Eclipse), the HealthIndicator.health() method is invoked during startup.
However, when I run the application by using the JAR file itself, the HealthIndicator.health() method is not invoked during startup of the application.
Why is the HealthIndicator.health() method not invoked during startup when I run it as a JAR file, and shouldn't it behave similarly as when running it through the IDE?
This is actually not really a bug, but a side effect caused by your IDE. You should be aware that Actuator endpoints are not only exposed over HTTP, but also over JMX. If you take a look at the documentation, you'll also see that the health endpoint is enabled by default on both HTTP and JMX.
Additionally, most IDEs, including IntelliJ and Eclipse, will enable a JMX agent when running the application through the IDE itself. This means that when the application is started, a JMX connection is made, which will in turn trigger the custom health indicator.
You can verify this quite easily, for example let's assume the following health indicator:
#Bean
public HealthIndicator alwaysUpHealthIndicator() {
return () -> {
log.info("Indicator invoked");
return Health.up().withDetail("Foo", "Bar").build();
};
}
If you change your IntelliJ run configuration and disable Enable JMX agent, you'll notice that the message no longer appears in the log.
Likewise, if you disable the health JMX endpoint, you'll also notice that you won't get the additional message within the logs:
management.endpoints.jmx.exposure.exclude=health
This means that you shouldn't rely on the HealthIndicator being executed during the startup of your application. If you have code that should be executed when your application is starting up, consider using an ApplicationRunner or CommandLineRunner bean. For example:
#Bean
public ApplicationRunner applicationRunner() {
return args -> log.info("This will be invoked on startup");
}
I can't answer directly the question, but it looks like there is no real question here, if its a bug - submit a bug to spring boot team. Otherwise its just a statement that I fully agree with.
Either HelathIndicator health() method should be executed in both ways or none at all.
What you're describing sounds more like a weird bug, here is a dirty way to check what happens (remove it or course in production):
Inside the health method of the health indicator obtain a stack trace and print it on console.
Analyze the stack trace and check that its not a result of some /health invocation through the HTTP (for example, maybe your IDE is configured to automatically call actuator's health upon start, who knows)

Readiness probe during Spring context startup

We are deploying our spring boot applications in OpenShift.
Currently we are trying to run a potentially long running task (database migration) before the webcontext is fully set up.
It is especially important that the app does not accept REST requests or process messages before the migration is fully run.
See the following minimal example:
// DemoApplication.java
#SpringBootApplication
public class DemoApplication {
public static void main(String[] args) {
SpringApplication.run(DemoApplication.class, args);
}
}
// MigrationConfig.java
#Configuration
#Slf4j
public class MigrationConfig {
#PostConstruct
public void run() throws InterruptedException {
log.info("Migration...");
// long running task
Thread.sleep(10000);
log.info("...Migration");
}
}
// Controller.java
#RestController
public class Controller {
#GetMapping("/test")
public String test() {
return "test";
}
}
// MessageHandler.java
#EnableBinding(Sink.class)
public class MessageHandler {
#StreamListener(Sink.INPUT)
public void handle(String message) {
System.out.println("Received: " + message);
}
}
This works fine so far: the auto configuration class is processed before the app responds to requests.
What we are worried about, however, is OpenShifts readiness probe: currently we use an actuator health endpoint to check if the application is up and running.
If the migration takes a long time, OpenShift might stop the container, potentially leaving us with inconsistent state in the database.
Does anybody have an idea how we could communicate that the application is starting, but prevent REST controller or message handlers from running?
Edit
There are multiple ways of blocking incoming REST requests, #martin-frey suggested a servletfilter.
The larger problem for us is stream listener. We use Spring Cloud Stream to listen to a RabbitMQ queue.
I added an exemplary handler in the example above.
Do you have any suggestions on how to "pause" that?
What about a servletfilter that knows about the state of the migration? That way you should be able to handle any inbound request and return a responsecode to your liking. Also there would be no need to prevent any requesthandlers until the system is fully up.
I think it can run your app pod without influence if you set up good enough initialDelaySeconds for initialization of your application.[0][1]
readinessProbe:
httpGet:
path: /_status/healthz
port: 8080
initialDelaySeconds: 10120
timeoutSeconds: 3
periodSeconds: 30
failureThreshold: 100
successThreshold: 1
Additionally, I recommend to set up the liveness probes with same condition (but more time than the readiness probes' value), then you can implement automated recovery of your pods if the application is failed until initialDelaySeconds.
[0] [ https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/#define-readiness-probes ]
[1] [ https://docs.openshift.com/container-platform/latest/dev_guide/application_health.html ]
How about adding an init container which only role is the db migration stuffs without the application.
Then another container to serve the application. But be careful when deploying the application with more than 1 replica. The replicas will also execute the initcontainer at the same time if you are using Deployment.
If you need multiple replicas, you might want to consider StatefulSets instead.
Such database migrations are best handled by switching to a Recreate deployment strategy and doing the migration as a mid lifecyle hook. At that point there are no instances of your application running so it can be safely done. If you can't have downtime, then you need to have the application be able to be switched to some offline or read/only mode against a copy of your database while doing the migration.
Don't keep context busy doing a long task in PostConstruct. Instead start migration as fully asynchronous task and allow Spring to build the rest of the context meanwhile. At the end of the task just set some shared Future with success or failure. Wrap controller in a proxy (can be facilitated with AOP, for example) where every method except the health check tries to get value from the same future within a timeout. If it succeeds, migration is done, all calls are available. If not, reject the call. Your proxy would serve as a gate allowing to use only part of API that is critical to be available while migration is going on. The rest of it may simply respond with 503 indicating the service is not ready yet. Potentially those 503 responses can also be improved by measuring and averaging the time migration typically takes and returning this value with RETRY-AFTER header.
And with the MessageHandler you can do essentially same thing. You wait for result of the future in the handle method (provided message handlers are allowed to hang indefinitely). Once the result is set, it will proceed with message handling from that moment on.

JVM Debugging without a Debugger

Newish to JVM debugging.
I've worked on supporting other products based on VxWorks written in C/C++. Within that environment we were able to do symbol lookups on a live system and peek or poke memory to get an idea what the software was doing or to alter it when other "normal" configuration options weren't available.
Now I'm supporting java applications. For issues that aren't readily reproducible within our labs, we are reduced to recompiling with additional instrumentation and performing binary replacements to gather more information to understand what is happening within the sw.
These are always on applications that folks don't take kindly to restarting.
Is there any similar type of debugging that can be taken for JVM applications? These are running on customer sites where using a conventional debugger is not an option for a number of reasons.
Please no lectures on how the application is poorly designed for support-ability. That's a given, we're just a couple of guys in support that have to figure it out the best we can.
thanks,
Abe
I've been in similar situation, where stopping application in debugger triggered timeouts on lower layers, and adding instrumentation was annoying because of restarts.
For me the solution was to add more logging statements.
I've used slf4j API, with logback implementation, I've enabled JMX on JVM and used it to enable/disable logging as needed and/or change classes that were logged.
If you use slf4j/logback the right way, there is very little overhead for disabled log statement, so I was able to use them liberally.
This way I could turn on "debug instrumentation", without annoying users with restarts.
Now some code:
This is testbed for experiment
package pl.gov.mofnet.giif.logondemand;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
public class LogOnDemand {
private static final Logger log = LoggerFactory.getLogger(LogOnDemand.class);
public static void main(String []args) throws InterruptedException {
for (int i = 0; i < 1000; i++) {
log.debug("i: {}", i);
Thread.sleep(1000);
}
}
}
This file needs to be put in default package (or anywhere on classpath).
<?xml version="1.0" encoding="UTF-8"?>
<configuration>
<jmxConfigurator />
<appender name="console" class="ch.qos.logback.core.ConsoleAppender">
<layout class="ch.qos.logback.classic.PatternLayout">
<Pattern>%date [%thread] %-5level %logger{25} - %msg%n</Pattern>
</layout>
</appender>
<root level="info">
<appender-ref ref="console" />
</root>
</configuration>
This code has compile time dependencies on org.slf4j:slf4j-api:1.7.7 and runtime dependency on ch.qos.logback:logback-classic:1.1.2
Run this code, in Java 7 you do not have to enable JMX explicitly if you connect to JMX from same machine. You will see that there is no output on console save for initial Logback configuration messages.
Start jconsole, connect to this process, in MBeans tab you will find ch.qos.logback.classic node, deep below it there are operations.
Edit parameters to setLoggerLevel, set p1 to package name of your class, in this case pl.gov.mofnet.giif and p2 to debug. Press setLoggerLevel button to run this operation, you should see log messages on console. To disable logging reset logger level to info or higher.

Resources