Mesos Agent can not access sandbox from UI - mesos

I can not access Sandbox logs nor I can see any task in a new agent that I've added to the cluster.
This is the error:
Mesos UI error
The new agent is detected and the bootstrapping it's okay with the ip and the hostname correctly configured:
"/usr/sbin/mesos-slave --hostname=mss4 --ip=10.32.8.160 --no-systemd_enable_support --work_dir=/tmp/mesos"
Any idea?
thanks

After hours of research...I hadn't in my hosts file the entry for that machine :(
Resolved.

Related

Error syncing pod on starting Beam - Dataflow pipeline from docker

We are constantly getting an error while starting our Beam Golang SDK pipeline (driver program) from a docker image which works when started from local / VM instance. We are using Dataflow runner for our pipeline and Kubernetes to deploy.
LOCAL SETUP:
We have GOOGLE_APPLICATION_CREDENTIALS variable set with service account for our GCP cluster. When running the job from local, job gets submitted to dataflow and completes successfully.
DOCKER SETUP:
Build image used is FROM golang:1.14-alpine. When we pack the same program with Dockerfile and try to run, it fails with error
User program exited: fork/exec /bin/worker: no such file or directory
On checking Stackdriver logs for more details, we see this:
Error syncing pod 00014c7112b5049966a4242e323b7850 ("dataflow-go-job-1-1611314272307727-
01220317-27at-harness-jv3l_default(00014c7112b5049966a4242e323b7850)"),
skipping: failed to "StartContainer" for "sdk" with CrashLoopBackOff:
"back-off 2m40s restarting failed container=sdk pod=dataflow-go-job-1-
1611314272307727-01220317-27at-harness-jv3l_default(00014c7112b5049966a4242e323b7850)"
Found reference to this error in Dataflow common errors doc, but it is too generic to figure out whats failing. After multiple retries, we were able to eliminate any permission / access related issues from pods. Not sure what else could be the problem here.
After multiple attempts, we decided to start the job manually from a new Debian 10 based VM instance and it worked. This brought to our notice that we are using alpine based golang image in Docker which may not have all the required dependencies installed to start the job.
On golang docker hub, we found a golang:1.14-buster where buster is codename for Debian 10. Using that for docker build helped us solve the issue. Self answering here to help anyone else facing the same issues.

SSH timeout error on Azure DevOps CD pipeline

I am getting an timeout error when trying to deploy to an VM instance hosted on AWS. Manually I can log ing using
ssh -i myKeyFile.pem myuser#IP
Once I accessed the remote machine I can execute some docker commands and everything works fine. But now that I need to automated that on the CD pipeline is where I am getting the following error:
2020-06-02T21:37:12.6877276Z Trying to establish an SSH connection to ***#IP:port
2020-06-02T21:38:52.4629461Z ##[error]Failed to connect to remote machine. Verify the SSH service connection details. Error: Error: Timed out while waiting for handshake.
2020-06-02T21:38:52.4685976Z ##[section]Finishing: Run shell commands on remote machine
The steps I follow to make the SSH connection are:
I created a SSH service connection on the project settings in Azure DevOps
I created the CD pipeline
I added a SSH task with the following parameters
When I manually trigger it to test if it works, the release start working fine but after 1:43 minutes more or less is when I got the error:
Then, when I review the logs, it is the same error I pasted at the beginning:
[error]Failed to connect to remote machine. Verify the SSH service connection details. Error: Error: Timed out while waiting for handshake
I've increase the handshake timeout settings from the default one (20000) to 90000, but no luck.
Any one has face this problem before?
Seems there is an ongoing error with the default agent pools from Azure DevOps. Lot of people have been reported this and Azure DevOps teams is working on it at the time this post is been written (I couldn't find the post where all that is details. I will add this later on).
The workaround is
To create a self-hosted agent.
After this has been created you will need to re-create your CD pipeline using the new self-hosted agent.
The rest of the SSH task configuration depends on your needs. But if you want to test the SSH connection works, just print something:
echo 'I'm connected'
After this you CD pipeline should be working fine.
More details on how to created the Self-Hosted Agent on Windows. There are also links for Linux and Mac.
I had a similar issue with a VM in Azure. It turned out I had set the security group to only allow SSH in from my local network and Azure Dev-Ops agents obviously run in a Microsoft network and were coming from a different IP Address range. The solution was to open up SSH to all source IP Addresses. You can get the list of IP address ranges Dev-Ops agents use but they appear to change every week which isn't very helpful.
See https://learn.microsoft.com/en-us/azure/devops/organizations/security/allow-list-ip-url?view=azure-devops#microsoft-hosted-agents

how to restart teamcity server

I am a beginner to teamcity. Our Teamcity 9 server stopped working after I installed Gradle. I doubt that it was problem with port or something like that. I removed Gradle but Teamcity didn't work. So I tried to restart Teamcity server. We have two teamcity agents. I stopped agents with:
sudo ./runAll.sh stop
and I stopped the server with sudo ./shutdwon.sh
after that I started server again with ./startup.sh and agents with
sudo ./runAll.sh start
Now when I am writing url address in browser I am getting either connection_timout or connection refused But when I am writing url with explicit IP address like 10.31.24.18:8111 then I am getting
My questions:
1- How can I restart Teamcity and agents so that I am getting same agents and project as before restart in TeamCity UI? Or If I am creating Administrator account now after that I should reconfigure all projects or my projects before restart will be there?
2- Why URL with IP-address is working but URL with domain name server name is not working?
You can restart TeamCity right from the UI: Administration > Diagnostics > Server Restart. You will need to have server admin permissions for that.
using command line
cd /opt/teamcity/bin
(sudo) ./teamcity-server.sh stop
(sudo) ./teamcity-server.sh start

Hadoop 2.6.4 Web UI Time Out

I installed Hadoop 2.6.4 on my AWS - 4 instance; 1 namenode; 1 secnamenode; 2 slaves. After the installation is completed, I tried seeing the namenode on Web UI using URL ec2-52-90-242-76.compute-1.amazonaws.com:50070 I am getting timed out.. anybody help??
If you are accessing from your system, you need to update your hosts files with IP address along with hostname or you can open directly with IP_address:50070
As well as check below
Check Firewall is on or off (Recommended is off)
Check Iptables service status (Recommended is stop)
Check SELINUX (Recommended is disables)

Cannot install jenkins slave on a windows machine

I am a very newbie to Jenkins ,
I am trying to Lunch a slave via Java Web Start, and when i try to install the Jenkins slave knowing that the master address is as follow http://masterdomain.com i see that the slave tries to connect to http://masterdomain.com:54999 , any ideas why this happens
Jenkins uses a special port for connections between the slave and the master. By default the port number is variable, but if you have a firewall between the master and the slave you can fix the port number via the URL <your Jenkins master URL>/configureSecurity/
Hey to answer your question from the comments above, the MASTER URL for jenkins is configured in the Manage Jenkins --> Configure System --> Jenkins Location
The Java Web start tries to get the URL from this location setting and uses it to setup the JNLP Agent on the slave.
Hope this helps.

Resources