Marathon kills container when requests increase - mesos

I deploy Docker containers on Mesos(0.21) and Marathon(0.7.6) on Google Cloud Engine.
I use JMeter to test a REST service that run on Marathon. When the concurrent requests are less than 10, it works normal, but when the concurrent requests are over 50, the container is killed and Mesos start another container. I increase RAM, CPU but it still happens.
This is log in /var/log/mesos/
E0116 09:33:31.554816 19298 slave.cpp:2344] Failed to update resources for container 10e47946-4c54-4d64-9276-0ce94af31d44 of executor dev_service.2e25332d-964f-11e4-9004-42010af05efe running task dev_service.2e25332d-964f-11e4-9004-42010af05efe on status update for terminal task, destroying container: Failed to determine cgroup for the 'cpu' subsystem: Failed to read /proc/612/cgroup: Failed to open file '/proc/612/cgroup': No such file or directory

The error message you're seeing is actually another symptom, not the root cause of the issue. There's a good explanation/discussion in this Apache Jira bug report:
https://issues.apache.org/jira/browse/MESOS-1837
Basically, your container is crashing for one reason or another and the /proc/pid#/ directory is getting cleared without Mesos being aware, so it throws the error message you found when it goes to check that /proc directory.
Try setting your allocated CPU higher in the JSON file describing your task.

Related

jmeter distributed testing slave finishes before schedule

i have a jmeter distributed system with 1 master and 4 slaves.
the test is configured to run for 60 minutes.
somehow suddenly a random slave finish the test and the load is distributed between the other 3.
all the slaves configured the same way.
the instances are aws ec2 instances on the same subnet
is there any explanation for this behaviour?
It might be the case you configured JMeter to stop thread when the error occurs yourself:
if you have marked settings under Thread Group it might be the case the Threads (virtual users) are being stopped or the whole test gets stopped on error
If unexpected error occurs there should be a corresponding entry in jmeter.log file, make sure to execute JMeter slave process providing log file location via -j command-line argument like:
./jmeter -s -j jmeter-slave.log .....
It might be the case your JMeter instance runs out of memory and the whole JVM gets terminated so make sure to properly tune it for high loads
Check operating system log of your Amazon instance
There could be multiple reason for it:
Possibly load balancing was not happening properly, more sets of request are getting drived toward one instance. That can cause the VM to crash
OR It could be the crashed AWS instance. The disk space got full.
I suggest you check the disk usage of crashed vm.

service was unable to place a task because no container instance met requirements. The closest matching container-instance has insufficient CPU units

I am trying to run 2 tasks on the same EC2 container. The EC2 container is running on a t2.large type EC2 instance.
One of the tasks (which is a daemon) starts fine and is RUNNING.
The other task which is an application, does not start and I see the following errors in the Events tab.
service test-service was unable to place a task because no container instance met all of its requirements. The closest matching container-instance xxxxxx has insufficient CPU units available. For more information, see the Troubleshooting section.
service test-service was unable to place a task because no container instance met all of its requirements. The closest matching container-instance xxxxxx has insufficient memory available. For more information, see the Troubleshooting section.
I looked at the CPU and memory section for the container instance and the values are -
Registered Available
CPU 1024 1014
Memory 985 729
My task definitions for the task that does not run has the following CPU and Memory values -
"memory": 512,
"cpu": 10
The daemon that successfully runs on the same EC2 container instance also has the same values for memory and CPU.
I read thru the AWS docs here at https://aws.amazon.com/premiumsupport/knowledge-center/ecs-container-instance-requirement-error/ and tried to reduce the CPU and memory requirements for the test-service task definition but nothing helped. I also changed the instance type to something bigger but that did not help either.
Can someone please help me with what I should do CPU and memory for both the tasks (daemon and application) so they can run on the same EC2 container instance ?
Note: I cannot add another container to the ECS cluster.
The task definition sets CPU limit to 10 units which is probably insufficient in your case. ECS can manage CPU resources dynamically when you set up the value to 0. However it is not possible in case of memory.

How to find Jenkins build jobs consuming high CPU and Memory

So far, Our Jenkins service hanging everyday, we cannot access the web page, I have to restart Jenkins services after hanging.
In Jenkins, I only set 4 of executors. Sometime, when 4 build job are running, the CPU load almost 90%, I guess that is a reason why Jenkins dies.
So how can I find which build job consume high CPU, and how can I find a root cause why the Jenkins die, I check in system log, but did not find any useful info.
I'm running Jenkins version 2.150.1 in Ubuntu 16.04.
Thanks.
We were also facing same issues.. We did setup prometheus + graphana monitoring for Jenkins..
This video is helpful - https://www.youtube.com/watch?v=3H9eNIf9KZs
With this - we able to point the timing when CPU was hitting at its peak and then we able to findout the job causing CPU consumption. We have switched to ECS(AWS Cloud) based job runnings.. i.e CPU and memory is separately allocated to each job..
We have also added jenkins observer script on machine where Jenkins is hoasted. we have checked if jenkins is down i.e if giving 502, we restarted it using crontab script.
These things helped us to solve the problems related to jenkins downtime.

Apache Flink on Windows

First, I am a complete newbie with Flink. I have installed Apache Flink on Windows.
I start Flink with start-cluster.bat. It prints out
Starting a local cluster with one JobManager process and one
TaskManager process. You can terminate the processes via CTRL-C in the
spawned shell windows. Web interface by default on
http://localhost:8081/.
Anyway, when I submit the job, I have a bunch of messages:
DEBUG org.apache.flink.runtime.rest.RestClient - Received response
{"status":{"id":"IN_PROGRESS"}}.
In the log in the web UI at http://localhost:8081/, I see:
2019-02-15 16:04:23.571 [flink-akka.actor.default-dispatcher-4] WARN
akka.remote.ReliableDeliverySupervisor
flink-akka.remote.default-remote-dispatcher-6 - Association with
remote system [akka.tcp://flink#127.0.0.1:56404] has failed, address
is now gated for [50] ms. Reason: [Disassociated]
If I go to the Task Manager tab, it is empty.
I tried to find if any port needed by flink was in use but it does not seem to be the case.
Any idea to solve this?
So I was running Flink locally using Intelij
Using ArchType that gives you ready to go examples
https://ci.apache.org/projects/flink/flink-docs-stable/dev/projectsetup/java_api_quickstart.html
You not necessary have to install it unless you are using Flink as a service on cluster.
Code editor will compile it just fine for spot instance of Flink for one code run.

I am not sure whether the application is running on just the master or the whole cluster for Spark on EC2

I am using Spark 1.1.1 . I followed the instructions given on https://spark.apache.org/docs/1.1.1/ec2-scripts.html and have a cluster of 1 master node and 1 worker on EC2 running.
I have made a jar of the application and rsynced it to the slaves. When I run the application using spark-submit with the deploy-mode of client, the application works. However, when I do so using deploy-mode cluster it gives me an error saying it cannot find the jar on the worker. The permission of the jar is 755 on both the master and worker.
I am not sure whether when I run the application using deploy-mode=client whether the application is using the workers. I don't think it is since the worker url does not show any completed jobs. But it does show failed jobs during deploy-mode=cluster.
Am I doing something wrong? Thank you for your help.
You can check if executors are assigned to the application on the /executors page on port 4040 (e.g. http://localhost:4040/executors/). If you only see <driver> then you are not using the worker. If you see one line for <driver> and one other line (with ID 0, unless it has restarted), then the worker is also providing an executor to your application. Here you can also see how many tasks it has completed for your application, and other stats.

Resources