We are constantly getting an error while starting our Beam Golang SDK pipeline (driver program) from a docker image which works when started from local / VM instance. We are using Dataflow runner for our pipeline and Kubernetes to deploy.
LOCAL SETUP:
We have GOOGLE_APPLICATION_CREDENTIALS variable set with service account for our GCP cluster. When running the job from local, job gets submitted to dataflow and completes successfully.
DOCKER SETUP:
Build image used is FROM golang:1.14-alpine. When we pack the same program with Dockerfile and try to run, it fails with error
User program exited: fork/exec /bin/worker: no such file or directory
On checking Stackdriver logs for more details, we see this:
Error syncing pod 00014c7112b5049966a4242e323b7850 ("dataflow-go-job-1-1611314272307727-
01220317-27at-harness-jv3l_default(00014c7112b5049966a4242e323b7850)"),
skipping: failed to "StartContainer" for "sdk" with CrashLoopBackOff:
"back-off 2m40s restarting failed container=sdk pod=dataflow-go-job-1-
1611314272307727-01220317-27at-harness-jv3l_default(00014c7112b5049966a4242e323b7850)"
Found reference to this error in Dataflow common errors doc, but it is too generic to figure out whats failing. After multiple retries, we were able to eliminate any permission / access related issues from pods. Not sure what else could be the problem here.
After multiple attempts, we decided to start the job manually from a new Debian 10 based VM instance and it worked. This brought to our notice that we are using alpine based golang image in Docker which may not have all the required dependencies installed to start the job.
On golang docker hub, we found a golang:1.14-buster where buster is codename for Debian 10. Using that for docker build helped us solve the issue. Self answering here to help anyone else facing the same issues.
Related
Im following the deployement guide of DataHub with Kubernetes present on the documentation: https://datahubproject.io/docs/deploy/kubernetes
Settin up the local clusten with Minikube I've started following the prerequisites session of the guide.
At first I tried to change some of the default values to try it locally (I've already installed it sucessfully on Google Kubernetes Engine, so I was trying different set ups)
But on the first step of the installation I've received the error:
Error: INSTALLATION FAILED: unable to build kubernetes objects from release manifest: resource mapping not found for name: "elasticsearch-master-pdb" namespace: "" from "": no matches for kind "PodDisruptionBudget" in version "policy/v1beta1"
ensure CRDs are installed first
The steps I've followed after installing Minikube where the exact steps presented on the page:
helm repo add datahub https://helm.datahubproject.io/
helm install prerequisites datahub/datahub-prerequisites
With the error happening on step 2
At first I've changed to the default configuration to see if it wasnt a mistake on the new values, but the error remained.
Ive expected that after followint the exact default steps the installation would be successfull locally, just like it was on the GKE
I got help browsing the DataHub slack community and figured out a way to fix this error.
It was simply a matter of a version error with Kubernetes, I was able to fix it by forcing minikube to start with the 1.19.0 version of Kubernetes:
minikube start --kubernetes-version=v1.19.0
I am deploying my Spring Boot application as a compiled jar file running in a docker container deployed to gcp, and deploys it through gcloud cli in my pipeline:
gcloud beta run deploy $SERVICE_NAME --image $IMAGE_NAME --region europe-north1 --project
Which will work and give me the correct response when the application succeeds to start. However, when there's an error and the application fails to start:
Cloud Run error: The user-provided container failed to start and listen on the port defined provided by the PORT=8080 environment variable.
The next time the pipeline runs (with the errors fixed), the gcloud beta run deploy command fails and gives the same error as seen above. While the actual application runs without issues in Cloud Run. How can I solve this?
Currently I have to check Cloud Run manually as I cannot trust my pipeline, and I have to run it twice to make it succeed. Any help will be appreciated. Let me know if you want any extra information.
I have Windows 10 OS, docker-compose and want to work with Apache Flink tutorial Playground, docker-compose starting correctly starting docker-compose but after several minutes of work,
Apache Flink has to create checkpoints, but there is some problem with access to the file system.
Exception:
org.apache.flink.runtime.checkpoint.CheckpointException: Could not finalize the pending checkpoint 104. Failure reason: Failure to finalize checkpoint.
at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1216) ~[flink-dist_2.11-1.12.1.jar:1.12.1]
…..
Caused by: org.apache.flink.util.SerializedThrowable: Mkdirs failed to create file:/tmp/flink-checkpoints-directory/d73c2f87b0d7ea6748a1913ee4b50afe/chk-104
at org.apache.flink.core.fs.local.LocalFileSystem.create(LocalFileSystem.java:262) ~[flink-dist_2.11-1.12.1.jar:1.12.1]
Could you please help me with the correct path and docker access?
state.backend: filesystem
state.checkpoints.dir: file:///tmp/flink-checkpoints-directory
state.savepoints.dir: file:///tmp/flink-savepoints-directory
Also I tried use full Windows path but got the same error.
Are you using Windows Docker containers or Linux Docker containers ?
Right-click on the Docker Desktop icon to see your current configuration.
Switch to Windows containers
OR
Switch to Linux containers
You have to configure Flink paths according to you target Docker container type.
Note: You cannot use Windows and Linux containers at the same time.
I am trying the Docker Get Started tutorial, Part 3 (Services). So the part where I need to init a swarm and deploy a stack, all my service status is rejected:
The full error (using --no-trunc) is:
hnsCall failed in Win32: The parameter is incorrect. (0x57)
Here are the steps I am doing:
Ensure my image is correct (the docker run works well, I accessed localhost:4000 successfully). Then I stopped the container to make sure it does not interfere.
When I init the swarm, it says I have multiple addresses, so I chose a random one (I tried with either of them, same result) using --advertise-addr.
docker stack deploy works, but when I check the status with docker service ps, none of them are up. localhost:4000 has no listener.
Note: I switched Docker to a Windows container.
I am new to Docker and this is beyond me. Can anyone please suggest a solution/debug way?
I tried everything but cannot get it to run on a Windows container so I switched to Linux container. The Get Started part 3 runs well.
Deployed fabric8 in Google Container Engine with 12 core 45GB RAM. Used gofabric8 0.4.69 for deploying fabric8 on GCE.
Tried to create a microservice, but it is failing in integration testing phase throwing the following error "Waiting for container:spring-boot. Reason:CrashLoopBackOff"
Please help to resolve this.
Which quickstart were you trying?
It sounds like the application terminated. I wonder if this shows any output:
kubectl get pod
kubectl logs nameofpod
where nameofpod is the pod that is crashing.
BTW the new fabric8-maven-plugin version (3.1.45 or later) now has a nicer fabric8:run goal.
If you clone the git repository to your local file system and update the version of fabric8-maven-plugin you should be able to run it via:
mvn fabric8:run
Then you get to see the output of the spring app in your console to see if something fails etc.