How to prevent TeamCity from spinning up an AWS cloud agent instance - amazon-ec2

When using TeamCity with the AWS Cloud Profile, it spins up the agent every 10 minutes for no reason.
I am using the agent to build a Yocto distribution, which takes a lot of time and resources. For that reason the agent is running on a large (expensive) instance which I only want to start when needed.
What I am trying to accomplish:
Have a cheap AWS instance running TeamCity server, always on.
Have a big, expensive AWS instance running a TeamCity Agent, but only turned on when needed.
Only spin up the agent machine when there is a build to run.
Place a tag in git when the build passes.
Preferably: Check for new git commits on the server, and start the agent when there is a commit.
Please note that I want to do tagging on the Agent, so server-side checkout will not work (according to the TeamCity documentation).
I created a setup in AWS where I have a small instance running TeamCity server, and a large (expensive) instance running the agent. In TeamCity, I configured a cloud profile in the Root project to automatically start the machine running the agent, and stop it when the build is finished or the machine has been idle for 5 minutes. Starting and stopping the AWS machine works, and triggering a build als works.
I created a project with a build configuration without any Triggers added to it. I also tried to increase the minimum checking interval for the VCS root for our git repository to 90000000 seconds (so that it will practically never spin up the agent machine).
For some reason, TeamCity spins up the AWS instance running the Agent every 10 minutes and it is not clear to me why it is doing this. The TeamCity log shows:
[2019-03-19 10:51:15,894] INFO - ldServer.AGENT.PollingProtocol - Polling Agent session 'f1nqO09...' was created, agent: "yocto-build-agent1" {id=3}
[2019-03-19 10:56:33,250] INFO - ldServer.AGENT.PollingProtocol - Polling Agent session 'f1nqO09...' was closed, agent: "yocto-build-agent1" {id=3}
[2019-03-19 10:56:33,257] INFO - jetbrains.buildServer.CLEANUP - Removing agent Unregistered agent "yocto-build-agent1" {id=3}
[2019-03-19 10:56:33,262] INFO - jetbrains.buildServer.AGENT - Agent "yocto-build-agent1" {id=3} has been unregistered. Reason: Agent has unregistered
[2019-03-19 11:02:30,612] INFO - jetbrains.buildServer.AGENT - Agent has been registered: "yocto-build-agent1" {id=4, protocol=unidirectional, host=10.55.20.127:9090, agentTypeId=3, pool=Default, registered since 2019-03-19 11:02:30.590}, not running a build
[2019-03-19 11:02:30,615] INFO - ldServer.AGENT.PollingProtocol - Polling Agent session 'FcY5MVV...' was created, agent: "yocto-build-agent1" {id=4}
[2019-03-19 11:02:30,666] WARN - jetbrains.buildServer.AGENT - Agent "yocto-build-agent1" {id=4} is unauthorized on registration
[2019-03-19 11:07:09,821] INFO - ldServer.AGENT.PollingProtocol - Polling Agent session 'FcY5MVV...' was closed, agent: "yocto-build-agent1" {id=4}
[2019-03-19 11:07:09,823] INFO - jetbrains.buildServer.AGENT - Agent "yocto-build-agent1" {id=4} has been unregistered. Reason: Agent has unregistered
[2019-03-19 11:07:22,455] INFO - jetbrains.buildServer.AGENT - Agent has been registered: "yocto-build-agent1" {id=4, protocol=unidirectional, host=10.55.20.127:9090, agentTypeId=3, pool=Default
The teamcity-clouds.log shows:
[2019-03-19 10:50:38,910] INFO [uled executor 1] - .instances.StoppedInstanceTask - Instance has changed status from stopped to Starting: Amazon Instance{instanceId=i-071eff7574b93d12f, imageId=i-071eff7574b93d12f, status: Starting}, profile 'AWS Cloud Profile'{id=amazon-2, projectId=_Root}
[2019-03-19 10:50:43,910] INFO [uled executor 5] - .server.impl.CloudEventsLogger - Cloud instance entered 'starting' state, profile 'AWS Cloud Profile'{id=amazon-2, projectId=_Root}, Amazon Instance{instanceId=i-071eff7574b93d12f, imageId=i-071eff7574b93d12f, status: Starting}
[2019-03-19 10:51:15,883] INFO [nio-8111-exec-2] - r.impl.DBCloudStateManagerImpl - Image: AmazonImageInstance{id=i-071eff7574b93d12f, amazonId=i-071eff7574b93d12f}, profile: profile 'AWS Cloud Profile'{id=amazon-2, projectId=_Root} was marked to CONTAIN agent
[2019-03-19 10:51:15,883] INFO [nio-8111-exec-2] - .server.impl.CloudEventsLogger - Detected cloud agent "yocto-build-agent1" {id=3}, profile 'AWS Cloud Profile'{id=amazon-2, projectId=_Root}, Amazon Instance{instanceId=i-071eff7574b93d12f, imageId=i-071eff7574b93d12f, status: Starting}
[2019-03-19 10:51:43,912] INFO [uled executor 2] - .server.impl.CloudEventsLogger - Cloud instance entered 'running' state, profile 'AWS Cloud Profile'{id=amazon-2, projectId=_Root}, Amazon Instance{instanceId=i-071eff7574b93d12f, imageId=i-071eff7574b93d12f, status: Running}
[2019-03-19 10:56:32,807] INFO [uled executor 2] - te.IdleTimeoutTerminateFactory - Will stop instance Amazon Instance{instanceId=i-071eff7574b93d12f, imageId=i-071eff7574b93d12f, status: Running}, profile 'AWS Cloud Profile'{id=amazon-2, projectId=_Root} with idle time 5 minutes (idle time is 5 minutes)
[2019-03-19 10:56:32,808] INFO [uled executor 2] - l.instances.StopInstanceAction - Terminating instance: Amazon Instance{instanceId=i-071eff7574b93d12f, imageId=i-071eff7574b93d12f, status: Running}, profile 'AWS Cloud Profile'{id=amazon-2, projectId=_Root}, reason: "After 5 minutes idle"
[2019-03-19 10:56:32,808] INFO [uled executor 2] - r.impl.DBCloudStateManagerImpl - Image: AmazonImageInstance{id=i-071eff7574b93d12f, amazonId=i-071eff7574b93d12f}, profile: profile 'AWS Cloud Profile'{id=amazon-2, projectId=_Root} was marked to CONTAIN agent
[2019-03-19 10:56:33,004] INFO [2 Stop Instance] - r.impl.DBCloudStateManagerImpl - Image: i-071eff7574b93d12f, Instance: i-071eff7574b93d12f, profile=amazon-2 is marked with state: stopped.
[2019-03-19 10:56:33,253] INFO [nio-8111-exec-7] - ces.terminate.AgentRemoverImpl - Removing cloud agent "yocto-build-agent1" {id=3} because: Instance has been terminated. Amazon Instance{instanceId=i-071eff7574b93d12f, imageId=i-071eff7574b93d12f, status: Running}, profile 'AWS Cloud Profile'{id=amazon-2, projectId=_Root}
[2019-03-19 10:56:33,260] INFO [nio-8111-exec-7] - .server.impl.CloudEventsLogger - Cloud agent Unregistered agent "yocto-build-agent1" {id=3} is no longer recognized as cloud agent because instance has gone: profileId=amazon-2, imageId=i-071eff7574b93d12f, instanceId=i-071eff7574b93d12f
[2019-03-19 10:56:33,260] INFO [nio-8111-exec-7] - ces.terminate.AgentRemoverImpl - Removing cloud agent Unregistered agent "yocto-build-agent1" {id=3} because: Cloud instance (id=i-071eff7574b93d12f, imageId=i-071eff7574b93d12f, profileId=amazon-2) has gone (is not reported by cloud profile)
[2019-03-19 10:56:33,922] INFO [uled executor 2] - .server.impl.CloudEventsLogger - Cloud instance entered 'scheduled to stop' state, profile 'AWS Cloud Profile'{id=amazon-2, projectId=_Root}, Amazon Instance{instanceId=i-071eff7574b93d12f, imageId=i-071eff7574b93d12f, status: Scheduled to stop}
[2019-03-19 10:56:43,923] INFO [uled executor 5] - .server.impl.CloudEventsLogger - Cloud instance entered 'stopping' state, profile 'AWS Cloud Profile'{id=amazon-2, projectId=_Root}, Amazon Instance{instanceId=i-071eff7574b93d12f, imageId=i-071eff7574b93d12f, status: Stopping}
[2019-03-19 10:57:43,925] INFO [uled executor 1] - .server.impl.CloudEventsLogger - Cloud instance entered 'stopped' state, profile 'AWS Cloud Profile'{id=amazon-2, projectId=_Root}, Amazon Instance{instanceId=i-071eff7574b93d12f, imageId=i-071eff7574b93d12f, status: Stopped}
[2019-03-19 11:02:30,613] WARN [io-8111-exec-11] - .instances.StoppedInstanceTask - Agent "yocto-build-agent1" {id=4} was not matched to stopped instance Amazon Instance{instanceId=i-071eff7574b93d12f, imageId=i-071eff7574b93d12f, status: Stopped}, profile 'AWS Cloud Profile'{id=amazon-2, projectId=_Root}
[2019-03-19 11:02:32,705] WARN [nio-8111-exec-4] - .instances.StoppedInstanceTask - Agent "yocto-build-agent1" {id=4} was not matched to stopped instance Amazon Instance{instanceId=i-071eff7574b93d12f, imageId=i-071eff7574b93d12f, status: Stopped}, profile 'AWS Cloud Profile'{id=amazon-2, projectId=_Root}
[2019-03-19 11:02:43,939] INFO [uled executor 4] - .instances.StoppedInstanceTask - Instance has changed status from stopped to Running: Amazon Instance{instanceId=i-071eff7574b93d12f, imageId=i-071eff7574b93d12f, status: Running}, profile 'AWS Cloud Profile'{id=amazon-2, projectId=_Root}
[2019-03-19 11:02:48,939] INFO [uled executor 3] - .server.impl.CloudEventsLogger - Cloud instance entered 'running' state, profile 'AWS Cloud Profile'{id=amazon-2, projectId=_Root}, Amazon Instance{instanceId=i-071eff7574b93d12f, imageId=i-071eff7574b93d12f, status: Running}
[2019-03-19 11:07:22,457] INFO [nio-8111-exec-1] - r.impl.DBCloudStateManagerImpl - Image: AmazonImageInstance{id=i-071eff7574b93d12f, amazonId=i-071eff7574b93d12f}, profile: profile 'AWS Cloud Profile'{id=amazon-2, projectId=_Root} was marked to CONTAIN agent
[2019-03-19 11:07:22,458] INFO [nio-8111-exec-1] - .server.impl.CloudEventsLogger - Detected cloud agent "yocto-build-agent1" {id=4}, profile 'AWS Cloud Profile'{id=amazon-2, projectId=_Root}, Amazon Instance{instanceId=i-071eff7574b93d12f, imageId=i-071eff7574b93d12f, status: Running}
[2019-03-19 11:12:32,818] INFO [uled executor 5] - te.IdleTimeoutTerminateFactory - Will stop instance Amazon Instance{instanceId=i-071eff7574b93d12f, imageId=i-071eff7574b93d12f, status: Running}, profile 'AWS Cloud Profile'{id=amazon-2, projectId=_Root} with idle time 5 minutes (idle time is 5 minutes)
[2019-03-19 11:12:32,818] INFO [uled executor 5] - l.instances.StopInstanceAction - Terminating instance: Amazon Instance{instanceId=i-071eff7574b93d12f, imageId=i-071eff7574b93d12f, status: Running}, profile 'AWS Cloud Profile'{id=amazon-2, projectId=_Root}, reason: "After 5 minutes idle"
[2019-03-19 11:12:32,818] INFO [uled executor 5] - r.impl.DBCloudStateManagerImpl - Image: AmazonImageInstance{id=i-071eff7574b93d12f, amazonId=i-071eff7574b93d12f}, profile: profile 'AWS Cloud Profile'{id=amazon-2, projectId=_Root} was marked to CONTAIN agent
[2019-03-19 11:12:33,050] INFO [2 Stop Instance] - r.impl.DBCloudStateManagerImpl - Image: i-071eff7574b93d12f, Instance: i-071eff7574b93d12f, profile=amazon-2 is marked with state: stopped.
[2019-03-19 11:12:33,403] INFO [nio-8111-exec-6] - ces.terminate.AgentRemoverImpl - Removing cloud agent "yocto-build-agent1" {id=4} because: Instance has been terminated. Amazon Instance{instanceId=i-071eff7574b93d12f, imageId=i-071eff7574b93d12f, status: Running}, profile 'AWS Cloud Profile'{id=amazon-2, projectId=_Root}
[2019-03-19 11:12:33,415] INFO [nio-8111-exec-6] - .server.impl.CloudEventsLogger - Cloud agent Unregistered agent "yocto-build-agent1" {id=4} is no longer recognized as cloud agent because instance has gone: profileId=amazon-2, imageId=i-071eff7574b93d12f, instanceId=i-071eff7574b93d12f
[2019-03-19 11:12:33,415] INFO [nio-8111-exec-6] - ces.terminate.AgentRemoverImpl - Removing cloud agent Unregistered agent "yocto-build-agent1" {id=4} because: Cloud instance (id=i-071eff7574b93d12f, imageId=i-071eff7574b93d12f, profileId=amazon-2) has gone (is not reported by cloud profile)
[2019-03-19 11:12:33,960] INFO [uled executor 5] - .server.impl.CloudEventsLogger - Cloud instance entered 'scheduled to stop' state, profile 'AWS Cloud Profile'{id=amazon-2, projectId=_Root}, Amazon Instance{instanceId=i-071eff7574b93d12f, imageId=i-071eff7574b93d12f, status: Scheduled to stop}
[2019-03-19 11:12:43,960] INFO [uled executor 5] - .server.impl.CloudEventsLogger - Cloud instance entered 'stopping' state, profile 'AWS Cloud Profile'{id=amazon-2, projectId=_Root}, Amazon Instance{instanceId=i-071eff7574b93d12f, imageId=i-071eff7574b93d12f, status: Stopping}
[2019-03-19 11:15:43,967] INFO [uled executor 3] - .server.impl.CloudEventsLogger - Cloud instance entered 'stopped' state, profile 'AWS Cloud Profile'{id=amazon-2, projectId=_Root}, Amazon Instance{instanceId=i-071eff7574b93d12f, imageId=i-071eff7574b93d12f, status: Stopped}
(instance id and agent name have been changed to protect the innocent)
How can I prevent TeamCity Server from spinning up the agent machine, even if there are no triggers configured?
What is the best way to start the agent machine when there is a git commit? I was thinking of a separate build on a smaller agent which only checks commits, and make the larger build depend on that smaller build.

This is not normal behavior. Logic is: Teamcity will start a cloud build agent if there are no on-premise agents available. Then it will terminate if when idle, depending on the options you provide.
I created several cloud profiles, in the Root project. I use below options:
Additional terminate conditions / If idle, stop 3 minutes before the full hour
Maximum instances count: x
Not sure if that's helpful but I don't have any "was not matched to stopped instance" entries in the teamcity-clouds.log, which I can see in your log.
Also - did you try enabling debug log level for cloud profiles? You can do it in Administration / Diagnostics (/admin/admin.html?item=diagnostics). Then check log files again.

The logging in TeamCity says "State changed from Stopped to Starting", which made me think it was TeamCity that started this agent. As it turned out, the AWS Instance which is running the build agent was started by OpsWorks.
Amazon OpsWorks was configured to monitor the state of the machine and keep it "up". So when TeamCity correctly shuts the server down, Amazon OpsWorks detects the machine going down, and tries to start it up again. After a while TeamCity detects it being "up" and "idle" and stops it again. This goes on indefinetely.
It would be nice to have an OpsWorks feature in TeamCity Cloud configuration in addition to the EC2 and VMWare features. This way TeamCity can tell Opsworks that the machine is supposed to go down.
For now, we have fixed the problem by disabeling OpsWorks health monitoring for the agents.

Related

Terracotta server stuck at DIAGNOSTIC state

I am trying to run Terracotta server 10.11 from https://www.terracotta.org/downloads/ to connect it with Ehcache in my Spring boot application. But the problem is that when i run Terracotta server instance using the server\bin\start-tc-server.bat it does not give any error but put the server in DIAGNOSTIC state as shown in logs below.
2022-11-24 14:22:01,660 INFO - Terracotta 5.8.5, as of 2021-12-16 at 22:21:34 UTC (Revision 3695ab2f870d94491c564e87c266555a7d1c096b from UNKNOWN)
2022-11-24 14:22:01,660 INFO - Extensions:
2022-11-24 14:22:01,679 INFO - PID is 23344
2022-11-24 14:22:01,965 INFO - Did not find configuration directory at: C:\Users\user\terracotta\config
2022-11-24 14:22:01,965 INFO - Starting node from config file: C:\Users\user\Downloads\ehcache-clustered-3.10.0-kit\ehcache-clustered-3.10.0-kit\server\conf\cluster.cfg
2022-11-24 14:22:02,266 INFO - Found only one node information in config file: C:\Users\user\Downloads\ehcache-clustered-3.10.0-kit\ehcache-clustered-3.10.0-kit\server\conf\cluster.cfg
2022-11-24 14:22:02,267 INFO - Starting unconfigured node: default-node
2022-11-24 14:22:02,271 INFO - Bootstrapped nomad system with root: C:\Users\user\terracotta\config
2022-11-24 14:22:02,277 INFO - Startup configuration of the node:
client-lease-duration=150s
client-reconnect-window=120s
cluster-name=clustered
failover-priority=availability
offheap-resources=main\:512MB
stripe.1.node.1.bind-address=0.0.0.0
stripe.1.node.1.group-bind-address=0.0.0.0
stripe.1.node.1.group-port=9430
stripe.1.node.1.hostname=localhost
stripe.1.node.1.log-dir=%H/terracotta/logs
stripe.1.node.1.name=default-node
stripe.1.node.1.port=9410
stripe.1.stripe-name=default-stripe
2022-11-24 14:22:02,280 INFO - Logging directory is not set. Logging only to the console
2022-11-24 14:22:02,293 INFO - Available Max Runtime Memory: 1820MB
2022-11-24 14:22:02,314 INFO - Creating server nodeID: NodeID[localhost:9410]
2022-11-24 14:22:02,546 INFO - Initializing LeaseServiceProvider with default lease length of 150000 ms
2022-11-24 14:22:02,548 INFO - Initializing org.terracotta.lease.service.LeaseServiceProvider#4cf92ef3
2022-11-24 14:22:02,549 INFO - Initializing org.terracotta.client.message.tracker.OOOMessageHandlerProvider#40f5b3f9
2022-11-24 14:22:02,556 INFO - Registered MBean with name: DiagnosticRequestHandler
2022-11-24 14:22:02,557 INFO - Registered Diagnostic Service: org.terracotta.nomad.server.NomadServer
2022-11-24 14:22:02,557 INFO - Registered Diagnostic Service: org.terracotta.dynamic_config.api.service.DynamicConfigService
2022-11-24 14:22:02,558 INFO - Registered Diagnostic Service: org.terracotta.dynamic_config.api.service.TopologyService
2022-11-24 14:22:02,558 INFO - Initializing org.terracotta.diagnostic.server.DiagnosticServiceProvider#1bf35727
2022-11-24 14:22:02,561 INFO - Initializing org.terracotta.diagnostic.server.extensions.DiagnosticExtensionsServiceProvider#410ee45a
2022-11-24 14:22:02,804 INFO - Initializing org.terracotta.management.service.monitoring.MonitoringServiceProvider#65c7455b
2022-11-24 14:22:02,804 INFO - Initializing org.terracotta.platform.ServerInfoProvider#240d561b
2022-11-24 14:22:02,806 INFO - Registered dynamic configuration change handler for setting client-reconnect-window: org.terracotta.dynamic_config.server.service.handler.ClientReconnectWindowConfigChangeHandler#74d20602
2022-11-24 14:22:02,809 INFO - Registered dynamic configuration change handler for setting log-dir: org.terracotta.dynamic_config.server.service.handler.NodeLogDirChangeHandler#67c6fc00
2022-11-24 14:22:02,810 INFO - Registered dynamic configuration change handler for setting failover-priority: ConfigChangeHandler#accept()
2022-11-24 14:22:02,810 INFO - Registered dynamic configuration change handler for setting public-hostname: ConfigChangeHandler#accept()
2022-11-24 14:22:02,811 INFO - Registered dynamic configuration change handler for setting public-port: ConfigChangeHandler#accept()
2022-11-24 14:22:02,811 INFO - Registered dynamic configuration change handler for setting cluster-name: ConfigChangeHandler#accept()
2022-11-24 14:22:02,812 INFO - Registered dynamic configuration change handler for setting lock-context: ConfigChangeHandler#accept()
2022-11-24 14:22:02,812 INFO - Registered dynamic configuration change handler for setting logger-overrides: org.terracotta.dynamic_config.server.service.handler.LoggerOverrideConfigChangeHandler#3ba87843
2022-11-24 14:22:02,813 INFO - Registered dynamic configuration change handler for setting tc-properties: org.terracotta.dynamic_config.server.api.SelectingConfigChangeHandler#16df9889
2022-11-24 14:22:02,815 INFO - Initializing org.terracotta.dynamic_config.server.service.DynamicConfigServiceProvider#29ca0612
2022-11-24 14:22:02,815 INFO - Registering implementation-provided service com.tc.services.PlatformServiceProvider#16b645b2
2022-11-24 14:22:02,816 INFO - Registering implementation-provided service com.tc.services.EntityMessengerProvider#3c352805
2022-11-24 14:22:02,816 INFO - Initializing com.tc.objectserver.persistence.NullPlatformStorageServiceProvider#149f57c4
2022-11-24 14:22:02,818 INFO - Registering implementation-provided service com.tc.services.LocalMonitoringProducer#5baa3715
2022-11-24 14:22:02,830 INFO - Creating 4 worker comm threads for default-node - L2_L1
2022-11-24 14:22:02,910 INFO - Registering implementation-provided service com.tc.services.CommunicatorService#7d51aa32
2022-11-24 14:22:02,920 INFO - HealthChecker Started
2022-11-24 14:22:02,952 INFO - Started the server in diagnostic mode
2022-11-24 14:22:02,967 INFO - Server started as default-node
2022-11-24 14:22:02,959 INFO - Terracotta Server instance has started diagnostic listening on all interfaces (address:/0.0.0.0 port:9410)
2022-11-24 14:22:03,177 INFO - Moved to State[ DIAGNOSTIC ]
According to the documentation it should be in ACTIVE state to be running properly. Still i tried to make connection with the server from my Spring boot application but it was also unable to reach it and gave TimeoutException.
I am using the following command to run the server instance:
./start-tc-server.bat -f C:\Users\user\Downloads\ehcache-clustered-3.10.0-kit\ehcache-clustered-3.10.0-kit\server\conf\cluster.cfg
Does anyone have any clue why its not getting to ACTIVE state ? maybe try to run it on your end and see if the server gets to ACTIVE state. Or is there anything i am missing ?
Thanks in Advance.
P.S I tried running older version of Terracotta server from the same downloads page and it easily goes to active state but i cannot use old version since it is not compaitable with Ehcache 3.x
You need to activate the server.
In the kit that you downloaded navigate to /tools/bin and you need to run config-tool.bat activate -f ../../server/conf/cluster.cfg
This will create a folder C:\Users\{user}\terracotta that will contain the configs and logs for the terracotta server, so the next time you start it will use the configs in the folder and automatically go to activate state.
If you need to change configs delete the folder, restart terracotta and activate it again using the config tool.

The oozie job does not run with the message [AM container is launched, waiting for AM container to Register with RM]

I ran a shell job among the oozie examples.
However, YARN application is not executed.
Detail information YARN UI & LOG:
https://docs.google.com/document/d/1N8LBXZGttY3rhRTwv8cUEfK3WkWtvWJ-YV1q_fh_kks/edit
YARN application status is
Application Priority: 0 (Higher Integer value indicates higher priority)
YarnApplicationState: ACCEPTED: waiting for AM container to be allocated, launched and register with RM.
Queue: default
FinalStatus Reported by AM: Application has not completed yet.
Finished: N/A
Elapsed: 20mins, 30sec
Tracking URL: ApplicationMaster
Log Aggregation Status: DISABLED
Application Timeout (Remaining Time): Unlimited
Diagnostics: AM container is launched, waiting for AM container to Register with RM
Application Attempt status is
Application Attempt State: FAILED
Elapsed: 13mins, 19sec
AM Container: container_1607273090037_0001_02_000001
Node: N/A
Tracking URL: History
Diagnostics Info: ApplicationMaster for attempt appattempt_1607273090037_0001_000002 timed out
Node Local Request Rack Local Request Off Switch Request
Num Node Local Containers (satisfied by) 0
Num Rack Local Containers (satisfied by) 0 0
Num Off Switch Containers (satisfied by) 0 0 1
nodemanager log
2020-12-07 01:45:16,237 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.scheduler.ContainerScheduler: Starting container [container_1607273090037_0001_01_000001]
2020-12-07 01:45:16,267 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_1607273090037_0001_01_000001 transitioned from SCHEDULED to RUNNING
2020-12-07 01:45:16,267 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Starting resource-monitoring for container_1607273090037_0001_01_000001
2020-12-07 01:45:16,272 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: launchContainer: [bash, /tmp/hadoop-oozie/nm-local-dir/usercache/oozie/appcache/application_1607273090037_0001/container_1607273090037_0001_01_000001/default_container_executor.sh]
2020-12-07 01:45:17,301 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: container_1607273090037_0001_01_000001's ip = 127.0.0.1, and hostname = localhost.localdomain
2020-12-07 01:45:17,345 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Skipping monitoring container container_1607273090037_0001_01_000001 since CPU usage is not yet available.
2020-12-07 01:45:48,274 INFO logs: Aliases are enabled
2020-12-07 01:54:50,242 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Cache Size Before Clean: 496756, Total Deleted: 0, Public Deleted: 0, Private Deleted: 0
2020-12-07 01:58:10,071 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for appattempt_1607273090037_0001_000001 (auth:SIMPLE)
2020-12-07 01:58:10,078 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Stopping container with container Id: container_1607273090037_0001_01_000001
What is the problem ?

Change Mesos Master Leader, cause Marathon shutdown?

Env:
Zookeeper on computer A,
Mesos master on computer B as Leader,
Mesos master on computer C,
Marathon on computer B singleton.
Action:
Kill Mesos master task on computer B, attempt to change mesos cluster leader
Result:
Mesos cluster leader change to mesos master on computer C,
But Marathon task on computer auto shutdown with following logs.
Question:
Somebody can help me why marathon down? and how to fix it!
Logs:
I1109 12:19:10.010197 11287 detector.cpp:152] Detected a new leader: (id='9')
I1109 12:19:10.010646 11291 group.cpp:699] Trying to get '/mesos/json.info_0000000009' in ZooKeeper
I1109 12:19:10.013425 11292 zookeeper.cpp:262] A new leading master (UPID=master#10.4.23.55:5050) is detected
[2017-11-09 12:19:10,015] WARN Disconnected (mesosphere.marathon.MarathonScheduler:Thread-23)
I1109 12:19:10.018977 11292 sched.cpp:2021] Asked to stop the driver
I1109 12:19:10.019161 11292 sched.cpp:336] New master detected at master#10.4.23.55:5050
I1109 12:19:10.019892 11292 sched.cpp:1203] Stopping framework d52cbd8c-1015-4d94-8328-e418876ca5b2-0000
[2017-11-09 12:19:10,020] INFO Driver future completed with result=Success(()). (mesosphere.marathon.MarathonSchedulerService:ForkJoinPool-3-worker-5)
[2017-11-09 12:19:10,022] INFO Abdicating leadership (mesosphere.marathon.core.election.impl.CuratorElectionService:ForkJoinPool-3-worker-5)
[2017-11-09 12:19:10,022] INFO Stopping the election service (mesosphere.marathon.core.election.impl.CuratorElectionService:ForkJoinPool-3-worker-5)
[2017-11-09 12:19:10,029] INFO backgroundOperationsLoop exiting (org.apache.curator.framework.imps.CuratorFrameworkImpl:Curator-Framework-0)
[2017-11-09 12:19:10,061] INFO Session: 0x15f710ffb010058 closed (org.apache.zookeeper.ZooKeeper:ForkJoinPool-3-worker-5)
[2017-11-09 12:19:10,061] INFO EventThread shut down for session: 0x15f710ffb010058 (org.apache.zookeeper.ClientCnxn:pool-3-thread-1-EventThread)
[2017-11-09 12:19:10,063] INFO Stopping MarathonSchedulerService [RUNNING]'s leadership (mesosphere.marathon.core.election.impl.CuratorElectionService:ForkJoinPool-3-worker-5)
[2017-11-09 12:19:10,063] INFO Lost leadership (mesosphere.marathon.MarathonSchedulerService:ForkJoinPool-3-worker-5)
[2017-11-09 12:19:10,066] INFO All actors suspended:
* Actor[akka://marathon/user/offerMatcherStatistics#-1904211014]
* Actor[akka://marathon/user/reviveOffersWhenWanted#-238627718]
* Actor[akka://marathon/user/expungeOverdueLostTasks#608979053]
* Actor[akka://marathon/user/launchQueue#803590575]
* Actor[akka://marathon/user/offersWantedForReconciliation#598482724]
* Actor[akka://marathon/user/offerMatcherLaunchTokens#813230776]
* Actor[akka://marathon/user/offerMatcherManager#1205401692]
* Actor[akka://marathon/user/instanceTracker#1055980147]
* Actor[akka://marathon/user/killOverdueStagedTasks#-40058350]
* Actor[akka://marathon/user/taskKillServiceActor#-602552505]
* Actor[akka://marathon/user/rateLimiter#-911383474]
* Actor[akka://marathon/user/deploymentManager#2013376325] (mesosphere.marathon.core.leadership.impl.LeadershipCoordinatorActor:marathon-akka.actor.default-dispatcher-10)
I1109 12:19:10.069551 11272 sched.cpp:2021] Asked to stop the driver
[2017-11-09 12:19:10,068] INFO Stopping driver (mesosphere.marathon.MarathonSchedulerService:ForkJoinPool-3-worker-5)
[2017-11-09 12:19:10,069] INFO Stopped MarathonSchedulerService [RUNNING]'s leadership (mesosphere.marathon.core.election.impl.CuratorElectionService:ForkJoinPool-3-worker-5)
[2017-11-09 12:19:10,070] INFO Terminating due to leadership abdication or failure (mesosphere.marathon.core.election.impl.CuratorElectionService:ForkJoinPool-3-worker-5)
[2017-11-09 12:19:10,071] INFO Call postDriverRuns callbacks on (mesosphere.marathon.MarathonSchedulerService:ForkJoinPool-3-worker-5)
[2017-11-09 12:19:10,074] INFO Now standing by. Closing existing handles and rejecting new. (mesosphere.marathon.core.event.impl.stream.HttpEventStreamActor:marathon-akka.actor.default-dispatcher-12)
[2017-11-09 12:19:10,074] INFO Suspending scheduler actor (mesosphere.marathon.MarathonSchedulerActor:marathon-akka.actor.default-dispatcher-2)
[2017-11-09 12:19:10,083] INFO Finished postDriverRuns callbacks (mesosphere.marathon.MarathonSchedulerService:ForkJoinPool-3-worker-5)
[2017-11-09 12:19:10,084] INFO ExpungeOverdueLostTasksActor has stopped (mesosphere.marathon.core.task.jobs.impl.ExpungeOverdueLostTasksActor:marathon-akka.actor.default-dispatcher-9)
[1]+ Exit 137
I think there is wrong configuration in zookeeper cluster. Use 3 zookeeper cluster and 2 mesos master n multiple slaves. Ref : https://www.google.co.in/amp/s/beingasysadmin.wordpress.com/2014/08/16/managing-ha-docker-cluster-using-multiple-mesos-masters/amp/
Did you set masters reference to marathon conf?
can you do
cat /etc/marathon/conf/master

How can I compile my C++ code with Mesos?

I got a cluster of 1 master node and 2 slaves and I'm trying to compile my application with mesos.
Basically, here is the command that I use:
mesos-execute --name=alc1 --command="ccmake -j myapp" --master=10.11.12.13:5050
Offers are made from the slave but this compilation task keeps failing.
[root#master-node ~]# mesos-execute --name=alc1 --command="ccmake -j myapp" --master=10.11.12.13:5050
I0511 22:26:11.623016 11560 sched.cpp:222] Version: 0.28.0
I0511 22:26:11.625602 11564 sched.cpp:326] New master detected at master#10.11.12.13:5050
I0511 22:26:11.625952 11564 sched.cpp:336] No credentials provided. Attempting to register without authentication
I0511 22:26:11.627279 11564 sched.cpp:703] Framework registered with 70582e35-5d6e-4915-a919-cae61c904fd9-0139
Framework registered with 70582e35-5d6e-4915-a919-cae61c904fd9-0139
task alc1 submitted to slave 70582e35-5d6e-4915-a919-cae61c904fd9-S2
Received status update TASK_RUNNING for task alc1
Received status update TASK_FAILED for task alc1
I0511 22:26:11.759610 11567 sched.cpp:1903] Asked to stop the driver
I0511 22:26:11.759639 11567 sched.cpp:1143] Stopping framework '70582e35-5d6e-4915-a919-cae61c904fd9-0139'
On the sandbox slave node, here is the stderr logs:
I0511 22:26:13.781070 5037 exec.cpp:143] Version: 0.28.0
I0511 22:26:13.785001 5040 exec.cpp:217] Executor registered on slave 70582e35-5d6e-4915-a919-cae61c904fd9-S2
sh: ccmake: command not found
I0511 22:26:13.892653 5042 exec.cpp:390] Executor asked to shutdown
Just to mentionned that commands like this work fine and get me the expected results:
[root#master-node ~]# mesos-execute --name=alc1 --command="find / -name a" --master=10.11.12.13:5050
I0511 22:26:03.733172 11550 sched.cpp:222] Version: 0.28.0
I0511 22:26:03.736112 11554 sched.cpp:326] New master detected at master#10.11.12.13:5050
I0511 22:26:03.736383 11554 sched.cpp:336] No credentials provided. Attempting to register without authentication
I0511 22:26:03.737730 11554 sched.cpp:703] Framework registered with 70582e35-5d6e-4915-a919-cae61c904fd9-0138
Framework registered with 70582e35-5d6e-4915-a919-cae61c904fd9-0138
task alc1 submitted to slave 70582e35-5d6e-4915-a919-cae61c904fd9-S2
Received status update TASK_RUNNING for task alc1
Received status update TASK_FINISHED for task alc1
I0511 22:26:04.184813 11553 sched.cpp:1903] Asked to stop the driver
I0511 22:26:04.184844 11553 sched.cpp:1143] Stopping framework '70582e35-5d6e-4915-a919-cae61c904fd9-0138'
I don't really get what is needed for even troubleshot this issue.

Teamcity error - Image does not exist in Amazon EC2

I am using teamcity server 8.0.3 (build 27540) and have set up several git projects. I am using Amazon EC2 images which are set up as cloud agents.
My setup starts up the agents and runs the tests on them when a new job is put to the queue.
I got it working fine for weeks and have not touched the server, and suddenly it stopped working.
In the cloud tabs I now get "Image error" with the message: "Image does not exist in Amazon EC2". Even though the image exists there!
If I open and save or re-create my cloud agent settings then one of the agents will be awoken by teamcity and do builds just fine, but the warning is still there and when the agent shuts down after my time delay I have to re-save the cloud agent settings to force a start on the agent.
I have tried restarting Teamcity with no avail.
This is the errors from teamcity-cloud.log
[2013-12-12 06:11:00,850] WARN [rome'{id=cp3} 1] - .clouds.amazon.image.ImageBase - Failed to fetch data of image AmazonImageInstance{id=i-d123cd29, amazonId=ami-463edb31} from EC2. Image does not exist in Amazon EC2
[2013-12-12 06:11:00,958] INFO [ue; Flush Queue] - .instances.StartInstanceAction - Starting cloud instance: profile 'EC2 TC Win Chrome'{id=cp3}, AmazonImageInstance{id=i-d123cd29, amazonId=ami-463edb31}, hash=EZObYYzWwxDOZ4o9svYSvGbdLqf5e7KQ, reason=Automatic start: Compatible with queued build Booking Log Client :: Win Chrome(promotionId=12008)
[2013-12-12 06:11:01,565] INFO [ue; Flush Queue] - .server.impl.CloudEventsLogger - Cloud instance start succeeded: profile 'EC2 TC Win Chrome'{id=cp3}, Amazon Instance{instanceId=i-d123cd29, imageId=i-d123cd29, amazonImageId=ami-463edb31, status: Scheduled to start}
[2013-12-12 06:11:03,135] INFO [rome'{id=cp3} 1] - r.impl.DBCloudStateManagerImpl - Image: i-d123cd29, Instance: i-d123cd29 is marked with state: running.
[2013-12-12 06:12:18,441] WARN [uled executor 4] - .instances.StoppedInstanceTask - Instance has changed status from stopped to Running: Amazon Instance{instanceId=i-d123cd29, imageId=i-d123cd29, amazonImageId=ami-463edb31, status: Running}, profile 'EC2 TC Win Chrome'{id=cp3}
[2013-12-12 06:12:20,759] WARN [rome'{id=cp3} 1] - .clouds.amazon.image.ImageBase - Failed to fetch data of image AmazonImageInstance{id=i-d123cd29, amazonId=ami-463edb31} from EC2. Image does not exist in Amazon EC2
[2013-12-12 06:13:04,668] INFO [nio-8111-exec-9] - r.impl.DBCloudStateManagerImpl - Image: AmazonImageInstance{id=i-d123cd29, amazonId=ami-463edb31}, profile: profile 'EC2 TC Win Chrome'{id=cp3} was marked to CONTAIN agent
[2013-12-12 06:13:04,672] INFO [nio-8111-exec-9] - .server.impl.CloudEventsLogger - Detected cloud agent EC2-i-d123cd29 {id=24, host=10.254.1.23:9090, agentTypeId=122, pool=Win Chrome}, profile 'EC2 TC Win Chrome'{id=cp3}, Amazon Instance{instanceId=i-d123cd29, imageId=i-d123cd29, amazonImageId=ami-463edb31, status: Running}
and from teamcity-server.log
[2013-12-12 06:13:04,648] WARN - jetbrains.buildServer.AGENT - Agent EC2-i-dd123cd29 is unauthorized with comment: Cloud instance (id=i-dd123cd29, imageId=i-dd123cd29, profileId=cp3) has gone (is not reported by cloud profile)
[2013-12-12 06:13:04,661] INFO - jetbrains.buildServer.SERVER - Agent has been registered: EC2-i-dd123cd29 {id=24, host=10.254.1.23:9090, agentTypeId=122, pool=Win Chrome}, not running a build
[2013-12-12 06:13:04,690] INFO - tbrains.buildServer.ACTIVITIES - Agent "EC2-i-dd123cd29 {id=24, host=10.254.1.23:9090, agentTypeId=122, pool=Win Chrome}" was authorised with comment "Virtual agent is authorized automatically."
[2013-12-12 06:13:05,022] INFO - tbrains.buildServer.ACTIVITIES - Build started; Booking Log Client :: Win Chrome {id=booking_log_client, internal id=bt18} #34 {build id=6213, promotion id=12008, branch=<default>, history=false, agent=EC2-i-dd123cd29 {id=24, host=10.254.1.23:9090, agentTypeId=122, pool=Win Chrome}, triggered by "Nisse Nilsson" (##userId='10'). Started Thu Dec 12 06:13:04 UTC 2013}
[2013-12-12 06:13:57,586] INFO - jetbrains.buildServer.SERVER - Agent description updated for agent: EC2-i-dd123cd29 {id=24, host=10.254.1.23:9090, agentTypeId=122, pool=Win Chrome}
[2013-12-12 06:23:54,323] INFO - jetbrains.buildServer.SERVER - Agent description updated for agent: EC2-i-dd123cd29 {id=24, host=10.254.1.23:9090, agentTypeId=122, pool=Win Chrome}
The error appears due to:
The AMI 'ami-463edb31' is no longer available in EC2's 'EU (Ireland)' region.
source:https://aws.amazon.com/amis/amazon-public-images-basic-microsoft-windows-server-2008-64-bit

Resources