Nomad job stuck in pending state due to failed Consul KV resolution in template - consul

Nomad v1.0.4, Consul v1.7.3
We have a Nomad job spec with several task groups. Each task group has a single task. Each task has the same template stanza that references several Consul KV paths like so:
{{ if keyExists "services/mysql/database" }}
MYSQL_DB = "{{ key "services/mysql/database" }}"
{{ end }}
The Nomad job spec is programmatically generated in JSON format and submitted to the Nomad cluster via POST /jobs. All tasks in this job are constrained to run on the same host machine.
We are seeing that some (not all) of the tasks become stuck in a pending state with allocation errors such as:
[1] Template failed: kv.block(services/mysql/database): Get "http://127.0.0.1:8500/v1/kv/services/mysql/database?index=1328&stale=&wait=60000ms": EOF
or
[2] Missing: kv.block(services/mysql/database)
Note that the specific Consul KV path indicated in the allocation error message is non-deterministic. As mentioned above, every job uses the same template stanza, and the template stanza itself references several Consul KV paths. For each failed allocation, the indicated Consul KV path in the allocation error may be different.
We've verified that the Consul cluster is alive and that all the KV paths referenced in the template stanza exist.
In theory, all the tasks should have encountered the same fate (e.g. failed) if either the Consul HTTP request was bad or the Consul KV path did not exist. As mentioned, only some of the tasks failed while the others successfully entered a running state. From this, we know that the template stanza itself is valid since at least some of the jobs are able to run successfully.
We verified that the Consul HTTP request was working by running it directly via cURL.
Interestingly, some of the failed tasks recover automatically over time when they are rescheduled in the future. However, others simply remain in a pending state forever.
Any insights about this behavior or possible solutions to explore are greatly appreciated.

Consul limits number of concurrent HTTP connections from single IP. You can try to check that.
In my nomad/consul deployment I had similar issue. First 20 tasks could start on a particular node but then the 21st failed to start because it could not read KV entry (but other 20 could read the same entry). It was behaving very strangely. Mentioned limit solved my issue.
Btw. I was skeptical to increase that limit from 200, it seemed high enough. However, it appears to me that one nomad task opens multiple consul HTTP connections, so my 20 tasks could quickly exhaust limit of 200.

Related

Ansible Tower/AWX - get pending hosts

I have a playbook that that runs maintenance on multiple hosts. I am using the AWX API to monitor the status of the job. The same script will cancel the job after a given timeout.
Question: How can I get a listing of hosts that are still running as I would like to generate an incident/alert that will include the slow servers prior to terminating the job.
Note: I tried using "/api/v2/jobs/{id}/job_host_summaries/" which is used to List Job Host Summaries for a Job. This however is only updated once the job is complete.
Any one have a quick way to get a list of hosts that are still running?

Does ECS update-service command marks the container instance state to draining when use with the --force-new-deployment option?

The command:
aws ecs update-service --service my-http-service --task-definition amazon-ecs-sample --force-new-deployment
As per AWS docs: You can use this option (--force-new-deployment) to trigger a new deployment with no service definition changes. For example, you can update a service's tasks to use a newer Docker image with the same image/tag combination (my_image:latest ) or to roll Fargate tasks onto a newer platform version.
My question, if I use '--force-new-deployment' (as I will use the exisiting tag or definition), will the underline 'ECS Instance' automatically set to DRAINING state, so that any new task (if any) will not start in the EXISTING ecs-instance that is suppose to go away during rolling-update deployment strategy (or deployment controller) ?
In other words, will there be any chance:
For a new task to be created on the existing/old container instance, that is suppose to go away during rolling update.
Also, what would happen with the ongoing task that is running on this existing/old container instance, that is suppose to go away during rolling update.
Ref: https://docs.aws.amazon.com/cli/latest/reference/ecs/update-service.html
Please note that no Container instance is going anywhere with this 'update-service' command. This command will only create a new Deployment under the ECS service and when the new tasks become healthy, remove the old task(s).
Edit 1:
How about the request that were served by old task?
I am assuming the tasks are behind an Application Load Balancer. In this case, old tasks will be deregistered from the ALB.
Note: In the following discussion, target is the ECS Task.
To give you a brief description on how the Deregistration Delay works with ECS, the following is the sequential order when a task connected to an ALB is stopped. It can be due to a scale in event, deployment of a new task definition, decrease of the number of tasks, force deployment, etc.
ECS sends DeregisterTargets call and the targets change the status to "draining". New connections will not be served to these targets.
If the deregistration delay time elapsed and there are still in-flight requests, the ALB will terminate them and clients will receive 5XX responses originated from the ALB.
The targets are deregistered from the target group.
ECS will send the stop call to the tasks and the ECS-agent will gracefully stop the containers (SIGTERM).
If the containers are not stopped within the stop timeout period (ECS_CONTAINER_STOP_TIMEOUT by default 30s) they will force stopped (SIGKILL).
As per the ELB documentation [1] if a deregistering target has no in-flight requests and no active connections, Elastic Load Balancing immediately completes the deregistration process, without waiting for the deregistration delay to elapse. However, even though target deregistration is complete, the status of the target will be displayed as draining and you can see the registered Target of the old task is still present in the TargetGroup console with status as draining until the deregistration delay elapses.
The design of ECS is to stop the Task after the completion of Draining process as mentioned in the ECS document [2] and hence the ECS Service waits for the TargetGroup to complete the Draining process before issuing the stop call.
Ref:
[1] Target Groups for Your Application Load Balancers - Deregistration Delay - https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-target-groups.html#deregistration-delay
[2] Updating a Service - https://docs.aws.amazon.com/AmazonECS/latest/developerguide/update-service.html

Setting up a Sensu-Go cluster - cluster is not synchronizing

I'm having an issue setting up my cluster according to the documents, as seen here: https://docs.sensu.io/sensu-go/5.5/guides/clustering/
This is a non-https setup to get my feet wet, I'm not concerned with that at the moment. I just want a running cluster to begin with.
I've set up sensu-backend on my three nodes, and have configured the backend configuration (backend.yml) accordingly on all three nodes through an ansible playbook. However, my cluster does not discover the other two nodes. It simply shows the following:
For backend1:
=== Etcd Cluster ID: 3b0efc7b379f89be
ID Name Peer URLs Client URLs
────────────────── ─────────────────── ─────────────────────── ───────────────────────
8927110dc66458af backend1 http://127.0.0.1:2380 http://localhost:2379
For backend2 and backend3, it's the same, except it shows those individual nodes as the only nodes in their cluster.
I've tried both the configuration in the docs, as well as the configuration in this git issue: https://github.com/sensu/sensu-go/issues/1890
None of these have panned out for me. I've ensured all the ports are open, so that's not an issue.
When I do a manual sensuctl cluster member-add X X, I get an error message and it results in the sensu-backend process failing. I can't remove the member, either, because it causes the entire process to not be able to start. I have to revert to an earlier snapshot to fix it.
The configs on all machines are the same, except the IP's and names are appropriated for each machine
etcd-advertise-client-urls: "http://XX.XX.XX.20:2379"
etcd-listen-client-urls: "http://XX.XX.XX.20:2379"
etcd-listen-peer-urls: "http://0.0.0.0:2380"
etcd-initial-cluster: "backend1=http://XX.XX.XX.20:2380,backend2=http://XX.XX.XX.31:2380,backend3=http://XX.XX.XX.32:2380"
etcd-initial-advertise-peer-urls: "http://XX.XX.XX.20:2380"
etcd-initial-cluster-state: "new" # have also tried existing
etcd-initial-cluster-token: ""
etcd-name: "backend1"
Did you find the answer to your question? I saw that you posted over on the Sensu forums as well.
In any case, the easiest thing to do in this case would be to stop the cluster, blow out /var/lib/sensu/sensu-backend/etcd/ and reconfigure the cluster. As it stands, the behavior you're seeing seems like the cluster members were started individually first, which is what is potentially causing the issue and would be the reason for blowing the etcd dir away.

Mesos task history after restart

I am using Mesos for container orchestration and get task history from Mesos using /task endpoint.
Mesos is running in a 7 nodes cluster and zookeeper is running in a 3 node cluster. I hope, Mesos uses Zookeeper to store the task History. We lost history sometimes when we restart Mesos. Does it store in memory? I am trying to understand what is happening here.
My questions are,
Where does it store task histories?
How can we configure the task history cleanup policy?
Why do we lose complete task history on restarting Mesos?
To answer your questions:
Task history/state for Mesos is stored in memory, and in the replicated_log (details here). The default is set to use the replicated_log, to store state completely in memory without the replicated_log you would have to specify this in your Mesos flags seen here in the configuration page as --registry=in_memory
Most users typically configure task history cleanup by using these three flags (there are more, but these are most common) --max_completed_frameworks=VALUE, --max_completed_tasks_per_framework=VALUE, and --max_unreachable_tasks_per_framework=VALUE as described in the previous document.
Yes, task history for the /tasks endpoint is lost every time a Mesos Master is restarted. However, the /state endpoint will still contain all task status changes over time.
**Edited to reflect information about the /tasks endpoint, not the /state endpoint.

SparkException: Master removed our application

I know there are other very similar questions on Stackoverflow but those either didn't get answered or didn't help me out. In contrast to those questions I put much more stack trace and log file information into this question. I hope that helps, although it made the question to become sorta long and ugly. I'm sorry.
Setup
I'm running a 9 node cluster on Amazon EC2 using m3.xlarge instances with DSE (DataStax Enterprise) version 4.6 installed. For each workload (Cassandra, Search and Analytics) 3 nodes are used. DSE 4.6 bundles Spark 1.1 and Cassandra 2.0.
Issue
The application (Spark/Shark-Shell) gets removed after ~3 minutes even if I do not run any query. Queries on small datasets run successful as long as they finish within ~3 minutes.
I would like to analyze much larger datasets. Therefore I need the application (shell) not to get removed after ~3 minutes.
Error description
On the Spark or Shark shell, after idling ~3 minutes or while executing (long-running) queries, Spark will eventually abort and give the following stack trace:
15/08/25 14:58:09 ERROR cluster.SparkDeploySchedulerBackend: Application has been killed. Reason: Master removed our application: FAILED
org.apache.spark.SparkException: Job aborted due to stage failure: Master removed our application: FAILED
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
FAILED: Execution Error, return code -101 from shark.execution.SparkTask
This is not very helpful (to me), that's why I'm going to show you more log file information.
Error Details / Log Files
Master
From the master.log I think the interesing parts are
INFO 2015-08-25 09:19:59 org.apache.spark.deploy.master.DseSparkMaster: akka.tcp://sparkWorker#172.31.46.48:46715 got disassociated, removing it.
INFO 2015-08-25 09:19:59 org.apache.spark.deploy.master.DseSparkMaster: akka.tcp://sparkWorker#172.31.33.35:42136 got disassociated, removing it.
and
ERROR 2015-08-25 09:21:01 org.apache.spark.deploy.master.DseSparkMaster: Application Shark::ip-172-31-46-49 with ID app-20150825091745-0007 failed 10 times, removing it
INFO 2015-08-25 09:21:01 org.apache.spark.deploy.master.DseSparkMaster: Removing app app-20150825091745-0007
Why do the worker nodes get disassociated?
In case you need to see it, I attached the master's executor (ID 1) stdout as well. The executors stderr is empty. However, I think it shows nothing useful to tackle the issue.
On the Spark Master UI I verified to see all worker nodes to be ALIVE. The second screenshot shows the application details.
There is one executor spawned on the master instance while executors on the two worker nodes get respawned until the whole application is removed. Is that okay or does it indicate some issue? I think it might be related to the "(it) failed 10 times" error message from above.
Worker logs
Furthermore I can show you logs of the two Spark worker nodes. I removed most of the class path arguments to shorten the logs. Let me know if you need to see it. As each worker node spawns multiple executors I attached links to some (not all) executor stdout and stderr dumps. Dumps of the remaining executors look basically the same.
Worker I
worker.log
Executor (ID 10) stdout
Executor (ID 10) stderr
Worker II
worker.log
Executor (ID 3) stdout
Executor (ID 3) stderr
The executor dumps seem to indicate some issue with permission and/or timeout. But from the dumps I can't figure out any details.
Attempts
As mentioned above, there are some similar questions but none of those got answered or it didn't help me to solve the issue. Anyway, things I tried and verified are:
Opened port 2552. Nothing changes.
Increased spark.akka.askTimeout which results in the Spark/Shark app to live longer but eventually it still gets removed.
Ran the Spark shell locally with spark.master=local[4]. On the one hand this allowed me to run queries longer than ~3 minutes successfully, on the other hand it obviously doesn't take advantage of the distributed environment.
Summary
To sum up, one could say that the timeouts and the fact long-running queries are successfully executed in local mode all indicate some misconfiguration. Though I cannot be sure and I don't know how to fix it.
Any help would be very much appreciated.
Edit: Two of the Analytics and two of the Solr nodes were added after the initial setup of the cluster. Just in case that matters.
Edit (2): I was able to work around the issue described above by replacing the Analytics nodes with three freshly installed Analytics nodes. I can now run queries on much larger datasets without the shell being removed. I intend not to put this as an answer to the question as it is still unclear what is wrong with the three original Analytics nodes. However, as it is a cluster for testing purposes, it was okay to simply replace the nodes (after replacing the nodes I performed a nodetool rebuild -- Cassandra on each of the new nodes to recover their data from the Cassandra datacenter).
As mentioned in the attempts, the root cause is a timeout between the master node, and one or more workers.
Another thing to try: Verify that all workers are reachable by hostname from the master, either via dns or an entry in the /etc/hosts file.
In my case, the problem was that the cluster was running in an AWS subnet without DNS. The cluster grew over time by spinning up a node, the adding the node to the cluster. When the master was built, only a subset of the addresses in the cluster was known, and only that subset was added to the /etc/hosts file.
When dse spark was run from a "new" node, then communication from the master using the worker's hostname failed and the master killed the job.

Resources