Problem with Hyperic monitoring on CloudFoundry - frequent alerts - alerts

I'm running single instance CloudFoundry configuration with one web application. I turned on Hyperic monitoring with notification for case of web app unavailability.
Now I randomly receive alert emails (Subject "An alert has been triggered - Deployment myapp - context unavailable") that the application is not running, but it obviously is running fine.
In access log of Apache I see two requests every 15 seconds:
127.0.0.1 - - [17/Mar/2010:15:37:33 +0100] "GET /server-status?auto HTTP/1.1" 200 438 "-" "Jakarta Commons-HttpClient/3.1"
127.0.0.1 - - [17/Mar/2010:15:37:33 +0100] "GET /myapp HTTP/1.1" 200 - "-" "Jakarta Commons-HttpClient/3.1"
At the time when I get the alert emails, everything in log still seems to be fine - two requests.
Do you have idea what could be wrong? Did anybody have this kind of problem and solve it?
Thanks,
P

Ok, got info from CloudFoundry guys. The alerts are sent if either Apache or internal Tomcat request goes wrong or is timeouted. My problem apparently came from internal Tomcat requests that are not logged in access log.
They now simply changed the algorithm, so the alert is trigerred when unavailability (Apache / Tomcat) is reported at least two times in a row. Frequent alert emails problem is gone.

Related

Java Spring Webflux - IP in access log

Small question regarding how to interpret the IP of an access log of a Spring WebFlux application please.
I have a very simple Spring WebFlux Application, with access log enabled, I can see the log, happy.
This application was built for only one client. Meaning, I am sure, at any time, there is one and only one client application calling in for sure.
Moreover, I know as a fact the client application is also unique. It is one client application that rarely restarts, there is only one instance of it, deployed in a fix physical datacenter, inside one physical hardware server. TLDR, there is only one client.
Yet, in my access log, each time this unique client calls in, I see something like:
INFO [myservice,,] 10 --- [or-http-epoll-4] reactor.netty.http.server.AccessLog : aa.1aa.1aa.aa - - "POST /myapi HTTP/1.1" 200 563 382
INFO [myservice,,] 10 --- [or-http-epoll-2] reactor.netty.http.server.AccessLog : bb.2bb.2bb.bb - - "POST /myapi HTTP/1.1" 200 563 372
And this is confusing me, as in why would I see multiple IP in this log?
Thank you

Running image with aws ecs throws 504 Gateway Time-out

I dockerized my Application. If i run it with docker run, evertything works fine.
I tried to run it with ecs fargate and put an ALB infront of it.
If i try to access my Application via the ALB dns, i get an 504 Gateway Teme-out back.
While searching a solution, i found an post, which told me to set the Tomcat timeout higher than the ELB timeout, but it doesn't helped.
Dockerfile
FROM tomcat:8.0.20-jre8
RUN sed -i 's/connectionTimeout="20000"/connectionTimeout="70000"/' /usr/local/tomcat/conf/server.xml
CMD ["catalina.sh","run"]
COPY /target/Webshop.war /usr/local/tomcat/webapps/
ELB Log
http 2019-09-11T11:20:50.585293Z app/Doces-Backe-19RQJLVNHYG2P/8fb4f4079bb6ff9f 66.85.6.136:47767 - -1 -1 -1 503 - 18 348 "GET http://:8080/ HTTP/1.0" "-" - - arn:aws:elasticloadbalancing:eu-central-1:573575081005:targetgroup/ecs-Docest-de-webshop/8df4f0978484f8bd "Root=1-5d78d892-58886d3490906f0fa3914563" "-" "-" 0 2019-09-11T11:20:50.462000Z "forward" "-" "-"
http 2019-09-11T11:23:23.535869Z app/Doces-Backe-19RQJLVNHYG2P/8fb4f4079bb6ff9f 66.85.6.136:50950 10.10.11.140:8080 -1 -1 -1 504 - 18 303 "GET http://:8080/ HTTP/1.0" "-" - - arn:aws:elasticloadbalancing:eu-central-1:573575081005:targetgroup/ecs-Docest-de-webshop/8df4f0978484f8bd "Root=1-5d78d921-a236121716bd1bd209625fd8" "-" "-" 0 2019-09-11T11:23:13.415000Z "forward" "-" "-"
http 2019-09-11T11:23:56.286426Z app/Doces-Backe-19RQJLVNHYG2P/8fb4f4079bb6ff9f 66.85.6.136:51658 10.10.11.140:8080 -1 -1 -1 504 - 18 303 "GET http://:8080/ HTTP/1.0" "-" - - arn:aws:elasticloadbalancing:eu-central-1:573575081005:targetgroup/ecs-Docest-de-webshop/8df4f0978484f8bd "Root=1-5d78d942-22a1680464884762e02ec940" "-" "-" 0 2019-09-11T11:23:46.156000Z "forward" "-" "-"
http 2019-09-11T11:23:27.513803Z app/Doces-Backe-19RQJLVNHYG2P/8fb4f4079bb6ff9f 66.85.6.136:51034 10.10.11.140:8080 -1 -1 -1 504 - 18 303 "GET http://:8080/ HTTP/1.0" "-" - - arn:aws:elasticloadbalancing:eu-central-1:573575081005:targetgroup/ecs-Docest-de-webshop/8df4f0978484f8bd "Root=1-5d78d925-b6b5daf0d0f733140aea0f84" "-" "-" 0 2019-09-11T11:23:17.393000Z "forward" "-" "-"
I expected to see my application running at the elb.
Thanks for your help!
Solution:
The problem was that I set the correct port in the security group of the load balancer, but not in that of the ECS service.
So I opened the required port there and now it works.
Procedure:
Go to your cluster
Go to the service with the problem
Click on the Security Group under the item Network Access and open the required port
Thanks!
There can be multiple reasons behind gateway timeout. The only thing that I do not like about fargate is debug-log. #AWS team should enable log configuration for fargate service by default as its hard to debug these issues without logs.
Better to configure log driver and push logs to cloud watch and see the actual issue also double check your desired port in task definition and mapped port in service.
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "awslogs-spring",
"awslogs-region": "us-west-2",
"awslogs-stream-prefix": "awslogs-example"
}
or from AWS console
You need to assign permission or role of cloud watch logs to task definition or service to push the logs to Cloud watch.
Once logs are configured then goto cloudwatch loggroup and search the log group so you will insight to your application.
But still, to troubleshoot the actual issue first, you have to understand the error code and possible reason of Gateway Timeout.
HTTP 504: Gateway Timeout
Description: Indicates that the load balancer closed a connection because a request did not complete within the idle timeout period.
Cause 1: The application takes longer to respond than the configured idle timeout.
Solution 1: Monitor the HTTPCode_ELB_5XX and Latency metrics. If there
is an increase in these metrics, it could be due to the application
not responding within the idle timeout period. For details about the
requests that are timing out, enable access logs on the load balancer
and review the 504 response codes in the logs that are generated by
Elastic Load Balancing. If necessary, you can increase your capacity
or increase the configured idle timeout so that lengthy operations
(such as uploading a large file) can complete. For more information,
see Configure the Idle Connection Timeout for Your Classic Load
Balancer and How do I troubleshoot Elastic Load Balancing high
latency.
Cause 2: Registered instances closing the connection to Elastic Load Balancing.
Solution 2: Enable keep-alive settings on your EC2 instances and make
sure that the keep-alive timeout is greater than the idle timeout
settings of your load balancer.

Gatling scenario appears to have network or HTTP corruption at high concurrent load on Windows

I have a very simple Gatling scenario which hits a single HTTP endpoint with concurrent users.
When I run this for 30 seconds with 10 requests per second, everything is fine.
When I run this for 30 seconds at 60 requests per second on Windows, I get very strange errors that look to me like the underlying network connections are getting corrupted or are being misused. Perhaps there is a race condition or concurrency bug somewhere in Gatling or somewhere else in my system.
I don't get the same problems on a linux machine.
The web server is nginx and PHP. I don't suspect that is the cause of the problem, but I might be wrong.
How can I track down and fix this bug?
The scenario code
val scn = scenario("my scenario - one endpoint only")
.exec(http("fetch")
.get("http://my.website/page"))
.inject(
constantUsersPerSec(requestsPerSecond)
.during(30.seconds)
.randomized)
.protocols(httpProtocol)
setUp(scn)
Symptoms
The scenario reports about 8% failure rate, with errors that look like the server is replying with malformed HTTP responses, returning HTML code where the HTTP status line should be. These vary in the details, but here is a representative example:
2017-02-20 17:30:59,875 DEBUG org.asynchttpclient.netty.request.NettyRequestSender - invalid version format: <META
java.lang.IllegalArgumentException: invalid version format: <META
at io.netty.handler.codec.http.HttpVersion.<init>(HttpVersion.java:130)
at io.netty.handler.codec.http.HttpVersion.valueOf(HttpVersion.java:84)
at io.netty.handler.codec.http.HttpResponseDecoder.createMessage(HttpResponseDecoder.java:118)
at io.netty.handler.codec.http.HttpObjectDecoder.decode(HttpObjectDecoder.java:219)
at io.netty.handler.codec.http.HttpClientCodec$Decoder.decode(HttpClientCodec.java:152)
at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:411)
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:248)
at io.netty.channel.CombinedChannelDuplexHandler.channelRead(CombinedChannelDuplexHandler.java:250)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:353)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:346)
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1294)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:353)
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:911)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:652)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:575)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:489)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:451)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:140)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
at java.lang.Thread.run(Thread.java:745)
Similarly, the server logs include invalid requests where the client has sent HTML where the HTTP request line should be:
10.56.4.130 - - [20/Feb/2017:17:30:59 +0000] "span class=\x22id4-cta-size-small id5-cta id4-cta-color-blue id4-cta-small-blue\x22><a hr" 400 166 "-" "-" "-" 0.179 -
10.56.4.130 - - [20/Feb/2017:17:31:00 +0000] "<!doctype html>" 400 166 "-" "-" "-" 0.070 -
Version info
I am using:
O/S: Windows 8.1 64 bit
Virus Scanner: I have Kapersky which intercepts network traffic. I tried turning it off, which made no difference. I don't know if it was "really" off.
VPN: My machine has a Windows Direct Connect VPN. The target site does not fall within that VPN.
Java: "1.8.0_121", Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)
Scala: 2.11.8
Gatling: 2.2.3
Akka: 2.4.12
io.netty.netty-handler: 4.0.42.Final
(4.0.41.Final was requested by netty-reactive-streams v 1.0.8, I wonder if that's significant)

IIS7.5 session hanging on local development machine

Summary
Sessions within my local IIS7.5 stop responding for no obvious reason.
Details
I'm developing ASP.NET 2.0 web applications using Visual Studio 2010 on a Windows 7 Ultimate 32-bit machine (which is a VMware instance running in VMware Workstation).
For no obvious reason, IIS just appears to stop working for the current session. If I restart the browser, it works... for a short time, and then stops again. If I open a different browser (while the first one is hanging) the new one works... for a short time.
Restarting IIS works (for a short time) or rebuilding the application (for a short time) - but there is absolutely no pattern to when it stops working... and it's driving me insane!!
There is no high-CPU-usage during this time, nor any high-memory-usage.
Nor does it not appear to be browser specific - I generally use FireFox for development, but this also happens on Chrome and IE. Nor is it just on the machine, but also when I test the website on old browsers running in other virtual instances.
I'm not sure when this started happening, so I am unable to say what (if anything) had changed at the time.
Can anybody suggest any reason why this might be happening?
UPDATE
This is now driving me insane - so I've been doing more investigation.
Here is a screen-shot of FireBug which is showing that the actual .aspx request is completing correctly, but for some reason IIS is simply not responding to the request for all the files within the page. The files are definitely there and have been served by IIS many, many times.
I have turned on the logs for IIS, and the only requests it has logged are those that show as successful in FireBug... those in red are missing.
#Fields: date time s-ip cs-method cs-uri-stem cs-uri-query s-port c-ip sc-status sc-substatus sc-win32-status time-taken
2013-02-06 11:00:40 127.0.0.1 GET /default.aspx - 80 superuser 127.0.0.1 200 0 0 15
2013-02-06 11:00:40 127.0.0.1 GET /Org/Layout/Css/v0/FrontGeneral.css - 80 - 127.0.0.1 200 0 0 15
2013-02-06 11:00:40 127.0.0.1 GET /WebResource.axd d=IJ9YYVsWm9qkk8kUYcn2sYcQLbYErTn4We9MkwgF6JGUiPeoRWMmAKKsi_AbjNJQ-Je-l4D-1zuU66SBZi_kDHe1u7c1&t=634604425351482412 80 superuser 127.0.0.1 200 0 0 0
2013-02-06 11:00:40 127.0.0.1 GET /Scripts/v0/DefaultButtonFix.js - 80 - 127.0.0.1 304 0 0 0
I have also turned on the "Trace Failed Requests" (using information from here) but that is not producing anything... the directory is empty
Still testing this but have finally found success. Disabling my AVG virus scanner seems to clear this issue right up. If you have a virus scanner/security package dont bother adding exceptions, just blanket disable it temporarily and give it a go. You can add the exceptions back in if this test proves successful.
Know how you feel. This has been driving me nuts for weeks. I have been tweaking FF and Chrome settings with no effect whatsoever.
Best of luck...

Rails logging 127.0.0.1 every 5 minutes

I have noticed in my production Rails log that exactly every 5 minutes, I have a GET request to my root url from 127.0.0.1 which apparently is my localhost.
Started GET "/" for 127.0.0.1 at 2012-07-01 14:05:03 -0500
Processing by ApplicationController#landing as */*
Rendered shared/_header.html.erb (0.9ms)
Rendered shared/_footer.html.erb (0.5ms)
Rendered application/landing.html.erb (5.7ms)
Completed 200 OK in 8ms (Views: 7.9ms)
I have never seen this in any other Rails apps. I am using New Relic, MongoDB, Nginx, and Unicorn. Can anyone tell my why this is happening or what it means?
This is most likely a monitoring application, especially since it's only checking the root path for a successful connection (i.e. HTTP 200). Have you installed any tools such as monit? What hosting provider are you using? They may monitor without you knowing.

Resources