Mesos framework stays inactive due to "Authentication failed: EOF" - go

I'm currently trying to deploy Eremetic (version 0.28.0) on top of Marathon using the configuration provided as an example. I actually have been able to deploy it once, but suddenly, after trying to redeploy it, the framework stays inactive.
By inspecting the logs I noticed a constant attempt to connect to some service that apparently never succeeds because of some authentication problem.
2017/08/14 12:30:45 Connected to [REDACTED_MESOS_MASTER_ADDRESS]
2017/08/14 12:30:45 Authentication failed: EOF
It looks like the service returning an error is ZooKeeper and more precisely it looks like the error can be traced back to this line in the Go ZooKeeper library. ZooKeeper however seems to work: I've tried to query it directly with zkCli and to run a small Spark job (where the Mesos master is given with zk:// URL) and everything seems to work.
Unfortunately I'm not able to diagnose the problem further, what could it be?

It turned out to be a configuration problem. The master URL was simply wrong and this is how the error was reported.

Related

How to troubleshoot DDEV DB container healthcheck timeout

When i want to start my DDEV Project an Container stucks at creating
Container ddev-oszimt-lf12a-v2-db Started
Error Message:
Failed waiting for web/db containers to become ready: db container failed: log=, err=health check timed out after 2m0s: labels map[com.ddev.site-name:oszimt-lf12a-v2 com.docker.compose.service:db] timed out without becoming healthy, status=
Its an Error i also had with some other projects.
In the Error Log is no information about this.
What could the Problem be and how do i fix it?
This isn't a very good place to study problems with specific projects, our Discord Channel is much better, or the DDEV issue queue.
But I'll try to give you some ideas about how to study and debug this.
Go to the Troubleshooting section of the docs. Work through it step-by-step.
As it says there, try the simplest possible project and see what the results are.
If the problem is particular to one particular project, see if you can remove customizations like .ddev/docker-compose.*.yaml files and config.*.yaml and non-standard things in the config.yaml file.
To find out what the causes the healthcheck timeout, see the docs on this exact problem, in your case the db container is timing out. So first, ddev logs -s db to see if something happened, and second docker inspect --format "{{json .State.Health }}" ddev-<projectname>-db.
For more help, you'll need to provide more information with things like your OS, Docker Provider, etc, and the easiest way to do that is to run ddev debug test and capture the output and put it in a gist on gist.github.com, then come over to discord with a link to that.

Can golang/build/cmd/coordinator be run locally?

I am using the instructions on https://pkg.go.dev/golang.org/x/build/cmd/coordinator#section-readme
to locally run a coordinator and buildlet. I have tried both on Darwin and Linux, but the results are that coordinator displays an error message saying:bad build params
Summary of the buildlet error is: connection refused
I suspect the coordinator is the issue because the buildlet works fine locally when the coordinator is not needed.
https://localhost:8119 is not working either.
It seems to me this could be a potential issue that needs to be submitted. Are these instructions currently working for anyone?

Google Cloud Data flow jobs failing with error 'Failed to retrieve staged files: failed to retrieve worker in 3 attempts: bad MD5...'

SDK: Apache Beam SDK for Go 0.5.0
We are running Apache Beam Go SDK jobs in Google Cloud Data Flow. They had been working fine until recently when they intermittently stopped working (no changes made to code or config). The error that occurs is:
Failed to retrieve staged files: failed to retrieve worker in 3 attempts: bad MD5 for /var/opt/google/staged/worker: ..., want ; bad MD5 for /var/opt/google/staged/worker: ..., want ;
(Note: It seems as if it's missing a second hash value in the error message message.)
As best I can guess there's something wrong with the worker - It seems to be trying to compare md5 hashes of the worker and missing one of the values? I don't know exactly what it's comparing to though.
Does anybody know what could be causing this issue?
The fix to this issue seems to have been to rebuild the worker_harness_container_image with the latest changes. I had tried this but I didn't have the latest release when I built it locally. After I pulled the latest from the Beam repo, and rebuilt the image (As per the notes here https://github.com/apache/beam/blob/master/sdks/CONTAINERS.md) and reran it seemed to work again.
I'm seeing the same thing. If I look into the Stackdriver logging I see this:
Handler for GET /v1.27/images/apache-docker-beam-snapshots-docker.bintray.io/beam/go:20180515/json returned error: No such image: apache-docker-beam-snapshots-docker.bintray.io/beam/go:20180515
However, I can pull the image just fine locally. Any ideas why Dataflow cannot pull.

activator ui show Waiting for sbt server to restart and Lost or failed sbt connection: Operation timed out

Activator ui view always waiting for sbt server to restart
and Build view always output lost or failed sbt connection: Operation timed out
Following is only the beginning of an answer. I've been researching a nearly identical problem, and it sounds like Activator uses an sbt plugin called sbt-remote-control for restarting the sbt server. I have not yet located where this plugin is called from Activator, but suspect something in this plugin is timing out, either due to a default or to an argument passed from Activator.
Activator FAQ says you can add a line to ~/.activator/activatorconfig.txt to lengthen a timeout, but I followed the suggestion and that did not resolve the problem for me. (I also found I had to create the activatorconfig.txt file for my installation.)
Security configuration may also be a factor. My version of this issue occurs on a corporate Win7 laptop with high security lockdown, but not on another personal Win7 laptop.

Ruby Stack failed to deploy on Google Developers Console

I tried to deploy Ruby stack using Google Developers Console, but no success. I tried several times at other project, error was always the same (below).
Do you have any idea why it keeps failing?
2014/10/23 15:59:44
rubyStackBox: PENDING
2014/10/23 15:59:55~2014/10/23 16:06:01
rubyStackBox: DEPLOYING
2014/10/23 16:06:11
rubyStackBox: DEPLOYMENT_FAILED
Replica rubystackbox-eaeo failed with status PERMANENTLY_FAILING: Replica State changed to PERMANENTLY_FAILING. Replica was unhealthy 2 consecutive times.
I replicated the issue you experienced several times and it also failed. What finally worked was playing with the zones/regions when deploying the ruby stack :
Developers console > Click-to-deploy > Set MySQL password > Advanced Options, choose a different zone and click Deploy.
Another useful tool when investigating this is Console Output. Even if the deployment fails, you can go to the VM instance and check View Output towards the bottom of the page. It will list all the packages and any errors encountered. The following command will achieve the same thing:
$ gcloud compute instances get-serial-port-output <INSTANCE_NAME> --project <PROJECT_ID> --zone <ZONE_NAME>
Please advise if still seeing issues.

Resources