Running image with aws ecs throws 504 Gateway Time-out - spring

I dockerized my Application. If i run it with docker run, evertything works fine.
I tried to run it with ecs fargate and put an ALB infront of it.
If i try to access my Application via the ALB dns, i get an 504 Gateway Teme-out back.
While searching a solution, i found an post, which told me to set the Tomcat timeout higher than the ELB timeout, but it doesn't helped.
Dockerfile
FROM tomcat:8.0.20-jre8
RUN sed -i 's/connectionTimeout="20000"/connectionTimeout="70000"/' /usr/local/tomcat/conf/server.xml
CMD ["catalina.sh","run"]
COPY /target/Webshop.war /usr/local/tomcat/webapps/
ELB Log
http 2019-09-11T11:20:50.585293Z app/Doces-Backe-19RQJLVNHYG2P/8fb4f4079bb6ff9f 66.85.6.136:47767 - -1 -1 -1 503 - 18 348 "GET http://:8080/ HTTP/1.0" "-" - - arn:aws:elasticloadbalancing:eu-central-1:573575081005:targetgroup/ecs-Docest-de-webshop/8df4f0978484f8bd "Root=1-5d78d892-58886d3490906f0fa3914563" "-" "-" 0 2019-09-11T11:20:50.462000Z "forward" "-" "-"
http 2019-09-11T11:23:23.535869Z app/Doces-Backe-19RQJLVNHYG2P/8fb4f4079bb6ff9f 66.85.6.136:50950 10.10.11.140:8080 -1 -1 -1 504 - 18 303 "GET http://:8080/ HTTP/1.0" "-" - - arn:aws:elasticloadbalancing:eu-central-1:573575081005:targetgroup/ecs-Docest-de-webshop/8df4f0978484f8bd "Root=1-5d78d921-a236121716bd1bd209625fd8" "-" "-" 0 2019-09-11T11:23:13.415000Z "forward" "-" "-"
http 2019-09-11T11:23:56.286426Z app/Doces-Backe-19RQJLVNHYG2P/8fb4f4079bb6ff9f 66.85.6.136:51658 10.10.11.140:8080 -1 -1 -1 504 - 18 303 "GET http://:8080/ HTTP/1.0" "-" - - arn:aws:elasticloadbalancing:eu-central-1:573575081005:targetgroup/ecs-Docest-de-webshop/8df4f0978484f8bd "Root=1-5d78d942-22a1680464884762e02ec940" "-" "-" 0 2019-09-11T11:23:46.156000Z "forward" "-" "-"
http 2019-09-11T11:23:27.513803Z app/Doces-Backe-19RQJLVNHYG2P/8fb4f4079bb6ff9f 66.85.6.136:51034 10.10.11.140:8080 -1 -1 -1 504 - 18 303 "GET http://:8080/ HTTP/1.0" "-" - - arn:aws:elasticloadbalancing:eu-central-1:573575081005:targetgroup/ecs-Docest-de-webshop/8df4f0978484f8bd "Root=1-5d78d925-b6b5daf0d0f733140aea0f84" "-" "-" 0 2019-09-11T11:23:17.393000Z "forward" "-" "-"
I expected to see my application running at the elb.
Thanks for your help!

Solution:
The problem was that I set the correct port in the security group of the load balancer, but not in that of the ECS service.
So I opened the required port there and now it works.
Procedure:
Go to your cluster
Go to the service with the problem
Click on the Security Group under the item Network Access and open the required port
Thanks!

There can be multiple reasons behind gateway timeout. The only thing that I do not like about fargate is debug-log. #AWS team should enable log configuration for fargate service by default as its hard to debug these issues without logs.
Better to configure log driver and push logs to cloud watch and see the actual issue also double check your desired port in task definition and mapped port in service.
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "awslogs-spring",
"awslogs-region": "us-west-2",
"awslogs-stream-prefix": "awslogs-example"
}
or from AWS console
You need to assign permission or role of cloud watch logs to task definition or service to push the logs to Cloud watch.
Once logs are configured then goto cloudwatch loggroup and search the log group so you will insight to your application.
But still, to troubleshoot the actual issue first, you have to understand the error code and possible reason of Gateway Timeout.
HTTP 504: Gateway Timeout
Description: Indicates that the load balancer closed a connection because a request did not complete within the idle timeout period.
Cause 1: The application takes longer to respond than the configured idle timeout.
Solution 1: Monitor the HTTPCode_ELB_5XX and Latency metrics. If there
is an increase in these metrics, it could be due to the application
not responding within the idle timeout period. For details about the
requests that are timing out, enable access logs on the load balancer
and review the 504 response codes in the logs that are generated by
Elastic Load Balancing. If necessary, you can increase your capacity
or increase the configured idle timeout so that lengthy operations
(such as uploading a large file) can complete. For more information,
see Configure the Idle Connection Timeout for Your Classic Load
Balancer and How do I troubleshoot Elastic Load Balancing high
latency.
Cause 2: Registered instances closing the connection to Elastic Load Balancing.
Solution 2: Enable keep-alive settings on your EC2 instances and make
sure that the keep-alive timeout is greater than the idle timeout
settings of your load balancer.

Related

Java Spring Webflux - IP in access log

Small question regarding how to interpret the IP of an access log of a Spring WebFlux application please.
I have a very simple Spring WebFlux Application, with access log enabled, I can see the log, happy.
This application was built for only one client. Meaning, I am sure, at any time, there is one and only one client application calling in for sure.
Moreover, I know as a fact the client application is also unique. It is one client application that rarely restarts, there is only one instance of it, deployed in a fix physical datacenter, inside one physical hardware server. TLDR, there is only one client.
Yet, in my access log, each time this unique client calls in, I see something like:
INFO [myservice,,] 10 --- [or-http-epoll-4] reactor.netty.http.server.AccessLog : aa.1aa.1aa.aa - - "POST /myapi HTTP/1.1" 200 563 382
INFO [myservice,,] 10 --- [or-http-epoll-2] reactor.netty.http.server.AccessLog : bb.2bb.2bb.bb - - "POST /myapi HTTP/1.1" 200 563 372
And this is confusing me, as in why would I see multiple IP in this log?
Thank you

Some postgress connections timing-out while others don't

I have an AWS EC2 machine running a Laravel 5.2 application that connects to a Postgress 9.6 databse running in RDS. While most of the connections work, some of them are getting rejected when trying to stablish, which causes a Timeout and consequently an error in my API. I don't know what is causing them to be rejected. Also, it is very random when it happens, when it does happen it may be in any API endpoint and inside the endpoint in any query.
When the timeout is handled by PHP, it shows a message like:
SQLSTATE[08006] [7] timeout expired (SQL: ...)
Sometimes the Nginx handles the timeout and replies with a 504 Error. When Nginx handles the timeout I get an error like:
2019/04/24 09:48:18 [error] 20657#20657: *3236 upstream timed out (110: Connection timed out) while reading response header from upstream, client: {client-ip-here}, server: {my-url-here}, request: "GET {my-endpoint-here} HTTP/2.0", upstream: "fastcgi://unix:/var/run/php/php7.0-fpm.sock", host: "{}", referrer: "https://app.cartoriovirtual.com/"
All usage charts on the RDS and EC2 seems ok, I have plenty of RAM, storage, CPU and available connections for RDS. I also checked inner VPC Flows and they seem alright, however I have many IPs (listed as attackers) scanning my network interfaces, most of them been rejected. Some (to port 22) accepted but stoped at authentication, I use a .pem Key File for auth.
The RDS network interface only accepts requests from inner VPC machines. In its logs, every 5 minutes I have a Checkpoint like this:
2019-04-25 01:05:29 UTC::#:[22595]:LOG: checkpoint starting: time
2019-04-25 01:05:34 UTC::#:[22595]:LOG: checkpoint complete: wrote 43 buffers (0.1%); 0 transaction log file(s) added, 0 removed, 1 recycled; write=4.393 s, sync=0.001 s, total=4.404 s; sync files=19, longest=0.001 s, average=0.000 s; distance=16515 kB, estimate=16515 kB
Anyone has tips on how to find a solution? I looked at all possible logs that came in mind, fixed a few little issues but the error persists. I am running out of ideas.

opendj 3.0 replication failed to start for about 2m entries

I'm testing opendj 3.0 replicatoin.
I have two opendj nodes which is a replica. The replication works nice.
But when I added about 2m entries, one opendj node failed to restart. I tried several times, but no luck. According to server.out, looks like some TimedOut, I'm not sure if it's related.
Any idea or workaround. I followed https://forum.forgerock.com/topic/replication-server-timed-out-waiting-for-monitor-data/ , add changed the monitor data timeout from 5 seconds to 60 seconds, and still no luck.
[03/Aug/2017:04:44:20 -0400] category=PLUGGABLE severity=NOTICE msgID=org.opends.messages.backend.513 msg=The database backend userRoot containing 2075308 entries has started
[03/Aug/2017:04:44:21 -0400] category=EXTENSIONS severity=NOTICE msgID=org.opends.messages.extension.221 msg=DIGEST-MD5 SASL mechanism using a server fully qualified domain name of: stg2-n6.nscloud.local
[03/Aug/2017:04:44:22 -0400] category=SYNC severity=NOTICE msgID=org.opends.messages.replication.204 msg=Replication server RS(31748) started listening for new connections on address 0.0.0.0 port 8989
[03/Aug/2017:04:44:23 -0400] category=SYNC severity=NOTICE msgID=org.opends.messages.replication.62 msg=Directory server DS(27712) has connected to replication server RS(31748) for domain "cn=admin data" at stg2-n6.nscloud.local/192.168.30.46:8989 with generation ID 161237
[03/Aug/2017:04:45:23 -0400] category=SYNC severity=WARNING msgID=org.opends.messages.replication.106 msg=Timed out waiting for monitor data for the domain "cn=schema" from replication server RS(19987)
[03/Aug/2017:04:46:23 -0400] category=SYNC severity=WARNING msgID=org.opends.messages.replication.106 msg=Timed out waiting for monitor data for the domain "dc=example,dc=com" from replication server RS(19987)
[03/Aug/2017:04:46:23 -0400] category=SYNC severity=WARNING msgID=org.opends.messages.replication.106 msg=Timed out waiting for monitor data for the domain "cn=admin data" from replication server RS(19987)

Gatling scenario appears to have network or HTTP corruption at high concurrent load on Windows

I have a very simple Gatling scenario which hits a single HTTP endpoint with concurrent users.
When I run this for 30 seconds with 10 requests per second, everything is fine.
When I run this for 30 seconds at 60 requests per second on Windows, I get very strange errors that look to me like the underlying network connections are getting corrupted or are being misused. Perhaps there is a race condition or concurrency bug somewhere in Gatling or somewhere else in my system.
I don't get the same problems on a linux machine.
The web server is nginx and PHP. I don't suspect that is the cause of the problem, but I might be wrong.
How can I track down and fix this bug?
The scenario code
val scn = scenario("my scenario - one endpoint only")
.exec(http("fetch")
.get("http://my.website/page"))
.inject(
constantUsersPerSec(requestsPerSecond)
.during(30.seconds)
.randomized)
.protocols(httpProtocol)
setUp(scn)
Symptoms
The scenario reports about 8% failure rate, with errors that look like the server is replying with malformed HTTP responses, returning HTML code where the HTTP status line should be. These vary in the details, but here is a representative example:
2017-02-20 17:30:59,875 DEBUG org.asynchttpclient.netty.request.NettyRequestSender - invalid version format: <META
java.lang.IllegalArgumentException: invalid version format: <META
at io.netty.handler.codec.http.HttpVersion.<init>(HttpVersion.java:130)
at io.netty.handler.codec.http.HttpVersion.valueOf(HttpVersion.java:84)
at io.netty.handler.codec.http.HttpResponseDecoder.createMessage(HttpResponseDecoder.java:118)
at io.netty.handler.codec.http.HttpObjectDecoder.decode(HttpObjectDecoder.java:219)
at io.netty.handler.codec.http.HttpClientCodec$Decoder.decode(HttpClientCodec.java:152)
at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:411)
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:248)
at io.netty.channel.CombinedChannelDuplexHandler.channelRead(CombinedChannelDuplexHandler.java:250)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:353)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:346)
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1294)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:353)
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:911)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:652)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:575)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:489)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:451)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:140)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
at java.lang.Thread.run(Thread.java:745)
Similarly, the server logs include invalid requests where the client has sent HTML where the HTTP request line should be:
10.56.4.130 - - [20/Feb/2017:17:30:59 +0000] "span class=\x22id4-cta-size-small id5-cta id4-cta-color-blue id4-cta-small-blue\x22><a hr" 400 166 "-" "-" "-" 0.179 -
10.56.4.130 - - [20/Feb/2017:17:31:00 +0000] "<!doctype html>" 400 166 "-" "-" "-" 0.070 -
Version info
I am using:
O/S: Windows 8.1 64 bit
Virus Scanner: I have Kapersky which intercepts network traffic. I tried turning it off, which made no difference. I don't know if it was "really" off.
VPN: My machine has a Windows Direct Connect VPN. The target site does not fall within that VPN.
Java: "1.8.0_121", Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)
Scala: 2.11.8
Gatling: 2.2.3
Akka: 2.4.12
io.netty.netty-handler: 4.0.42.Final
(4.0.41.Final was requested by netty-reactive-streams v 1.0.8, I wonder if that's significant)

Problem with Hyperic monitoring on CloudFoundry - frequent alerts

I'm running single instance CloudFoundry configuration with one web application. I turned on Hyperic monitoring with notification for case of web app unavailability.
Now I randomly receive alert emails (Subject "An alert has been triggered - Deployment myapp - context unavailable") that the application is not running, but it obviously is running fine.
In access log of Apache I see two requests every 15 seconds:
127.0.0.1 - - [17/Mar/2010:15:37:33 +0100] "GET /server-status?auto HTTP/1.1" 200 438 "-" "Jakarta Commons-HttpClient/3.1"
127.0.0.1 - - [17/Mar/2010:15:37:33 +0100] "GET /myapp HTTP/1.1" 200 - "-" "Jakarta Commons-HttpClient/3.1"
At the time when I get the alert emails, everything in log still seems to be fine - two requests.
Do you have idea what could be wrong? Did anybody have this kind of problem and solve it?
Thanks,
P
Ok, got info from CloudFoundry guys. The alerts are sent if either Apache or internal Tomcat request goes wrong or is timeouted. My problem apparently came from internal Tomcat requests that are not logged in access log.
They now simply changed the algorithm, so the alert is trigerred when unavailability (Apache / Tomcat) is reported at least two times in a row. Frequent alert emails problem is gone.

Resources