Microsoft Orleans in kubernetes StatefulSet POD crashes after several restart - consul

Microsoft Orleans v3.4.3
Consul Clustering
Running in K8S
siloBuilder
.UseConsulClustering(opt =>
{
opt.Address = new Uri(AppConfig.Orleans.ConsulUrl);
opt.AclClientToken = AppConfig.Orleans.AclClientToken;
})
.Configure<ClusterOptions>(options =>
{
options.ClusterId = AppConfig.Orleans.ClusterID;
options.ServiceId = AppConfig.Orleans.ServiceID;
})
.siloBuilder.UseKubernetesHosting();
I configured the labels and environment variables for my POD accordingly to the doc.
- name: ORLEANS_SERVICE_ID #Required by Orleans
valueFrom:
fieldRef:
fieldPath: metadata.labels['orleans/serviceId']
- name: ORLEANS_CLUSTER_ID #Required by Orleans
valueFrom:
fieldRef:
fieldPath: metadata.labels['orleans/clusterId']
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.labels['statefulset.kubernetes.io/pod-name']
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
It is a StatefulSet with only 1 POD for testing.
On initial startup, it works well.
However, every time when I restart the POD, a new entry is created in Consul.
And It crashes in subsequent startup.
The log says
System.AggregateException: One or more errors occurred. (Failed to get ping responses from 1 of 1 active silos. Newly joining silos validate connectivity with all active silos that have recently updated their 'I Am Alive' value before joining the cluster. Successfully contacted: []. Failed to get response from: [S10.18.123.218:11111:361110184])
---> Orleans.Runtime.MembershipService.OrleansClusterConnectivityCheckFailedException: Failed to get ping responses from 1 of 1 active silos. Newly joining silos validate connectivity with all active silos that have recently updated their 'I Am Alive' value before joining the cluster. Successfully contacted: []. Failed to get response from: [S10.18.123.218:11111:361110184]
at Orleans.Runtime.MembershipService.MembershipAgent.ValidateInitialConnectivity()
at Orleans.Runtime.MembershipService.MembershipAgent.BecomeActive()
at Orleans.Runtime.MembershipService.MembershipAgent.<>c__DisplayClass26_0.<<Orleans-ILifecycleParticipant<Orleans-Runtime-ISiloLifecycle>-Participate>g__OnBecomeActiveStart|6>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at Orleans.Runtime.SiloLifecycleSubject.MonitoredObserver.OnStart(CancellationToken ct)
at Orleans.LifecycleSubject.OnStart(CancellationToken ct)
at Orleans.Runtime.Scheduler.AsyncClosureWorkItem.Execute()
at Orleans.Runtime.Silo.StartAsync(CancellationToken cancellationToken)
at Orleans.Hosting.SiloHost.StartAsync(CancellationToken cancellationToken)
at Orleans.Hosting.SiloHostedService.StartAsync(CancellationToken cancellationToken)
at Microsoft.Extensions.Hosting.Internal.Host.StartAsync(CancellationToken cancellationToken)
at Microsoft.Extensions.Hosting.HostingAbstractionsHostExtensions.RunAsync(IHost host, CancellationToken token)
at Microsoft.Extensions.Hosting.HostingAbstractionsHostExtensions.RunAsync(IHost host, CancellationToken token)
at UBS.OrleansServer.EntryPoint.Start() in /app/UBS/OrleansServer/EntryPoint.cs:line 102
--- End of inner exception stack trace ---
I have to remove all the entries in Consul then restart the POD, then everything works fine.
The POD_NAME is the same for StatefulSet's POD, is it correct that each POD restart creates a new entry in Consul?
What could be the cause?
Thanks in advance
UPDATE
After several rounds crashes and restart, finally it does not crash any more. And in log I see the following message
ProcessTableUpdate (called from DeclareDead) membership table: 5 silos, 1 are Active, 4 are Dead, Version=<31, 28123>. All silos: [SiloAddress=S10.18.123.244:11111:361163684 SiloName=ubs-job-dev-0 Status=Active, SiloAddress=S10.18.123.200:11111:361158057 SiloName=ubs-job-dev-0 Status=Dead, SiloAddress=S10.18.123.210:11111:361161905 SiloName=ubs-job-dev-0 Status=Dead, SiloAddress=S10.18.123.217:11111:361157424 SiloName=ubs-job-dev-0 Status=Dead, SiloAddress=S10.18.123.244:11111:361163558 SiloName=ubs-job-dev-0 Status=Dead]
The SiloName never changes and there is only one POD in StatefulSet, but it sees 5 silos, 4 of them are dead. It seems each new POD, even if pod name does not change, is seen as a new silo. Is that expected?

(Failed to get ping responses from 1 of 1 active silos.
Newly joining silos validate connectivity with all active silos that have recently updated their 'I Am Alive' value before joining the cluster.
Successfully contacted: []. Failed to get response from: [S10.18.123.218:11111:361110184])
Looks like your membershiptable (in consul) thinks that you already have active silos in it. When your 'new' silo comes up and looks in the membershiptable it sees these active silos at the table's IP addresses.
To keep the cluster correct, a newly joining silo must be able to communicate with the existing silos. However if the membership table is incorrect (ip address with status 3/active) then you have a problem where the new silo is trying to ping the active silos and not being able to reach them will fail to join and fast itself.
You have a couple of solutions:
clear the consul table when deploying your solution
change the deploymentid on every deployment.
You obviously found the first solution (clear the table)
see silo lifecycle

Related

Kubernetes CronJob with spring boot task

I'm running a spring boot task inside a k8s pod. This is the k8s specification:
kind: CronJob
metadata:
name: data-transmission
spec:
schedule: "*/2 * * * *"
concurrencyPolicy: Forbid
successfulJobsHistoryLimit: 1
failedJobsHistoryLimit: 3
jobTemplate:
spec:
template:
spec:
containers:
- name: print-date
image: fredde:latest
imagePullPolicy: IfNotPresent
livenessProbe:
httpGet:
path: /actuator/health/liveness
port: 8080
initialDelaySeconds: 2
failureThreshold: 2
readinessProbe: null
restartPolicy: OnFailure
The pod starts as it should each 2 min. The task in the spring boot application is running and shutting down itself when it's done. But my issue is that the pod is still running even when the spring boot application has exited but it changes status to NotReady, i was expecting it to be complete or terminated.
A Job creates one or more Pods and will continue to retry execution of the Pods until a specified number of them successfully terminate. As pods successfully complete, the Job tracks the successful completions. When a specified number of successful completions is reached, the task (ie, Job) is complete. Deleting a Job will clean up the Pods it created. Suspending a Job will delete its active Pods until the Job is resumed again.
A simple case is to create one Job object in order to reliably run one Pod to completion. The Job object will start a new Pod if the first Pod fails or is deleted (for example due to a node hardware failure or a node reboot). (Source: Kubernetes)
In general jobs won’t get deleted from API as soon as they are successfully completed, they are kept for some time in order to have the information on whether a job is successful or not. Kubernetes' TTL-after-finished controller will take care of this, you can set an expiration time for jobs for deleting them from the API.
Note: This is taken from the official kubernetes documentation, go through the links for the exact point of reference.
I found a solution that works.
It seems like when my spring application is done and shutting down itself the k8s pod is still up and running because of istio.
There is an open pull request:
https://github.com/istio/istio/issues/11659

Azure Kubernetes, running DaemonSet to a pool "CriticalAddonsOnly=true:NoSchedule"

i'm configuring Elastic Cloud agent on Azure AKS with pool system and user. On system pool i configured CriticalAddonsOnly=true:NoSchedule taint to prevent that pod application run there. I installed the Elastic Cloud agent but i'm noticing that DaemonSet trying to run pods on that system pool without success. I tried to set on yaml config of agent the label CriticalAddonsOnly=true:NoSchedule but i got same errors. Is there a way to force deploy on system pool or to exclude ElasticCloud pods deploy on that pool?
Here how setup yaml:
tolerations:
- key: node-role.kubernetes.io/control-plane
effect: NoSchedule
- key: node-role.kubernetes.io/master
effect: NoSchedule
- key: CriticalAddonsOnly
operator: "Exists"
effect: NoSchedule
Regards
node-role.kubernetes.io/control-plane & node-role.kubernetes.io/master are no taints for AKS nodes. These are node labels. So please remove them from the toleration spec.
Furthermore specifying a toleration does not guarantee scheduling onto tolerated nodes. It just marks that the node should not accept any pods that do not tolerate the taints. As your 2nd node pool seems not to be tainted, the scheduler just drops your pods there.
You could now add taints to your other nodepools or more easier just specify a node selector =
nodeSelector:
kubernetes.azure.com/mode: system
tolerations:
- key: "CriticalAddonsOnly"
operator: "Exists"
effect: "NoSchedule"
The same could be also achieved with Node Affinity. You should check the Helm Chart or your deployment option if nodeSelector or NodeAffinity is available.

Is there any connection between application startup of a Springboot App and the readiness probe of the the K8s pod it is deployed on?

I want to mark the pod ready only when there are enough connections created and the pod is ready to handle requests. The connections are created at the startup of my Springboot application. How can I make sure that the pod is ready only after the connections are created?
You can write a small python (or other) script which checks if the connections are created and ready to receive requests.
Then, add it to the pod deployment yaml as an initContainers:
https://kubernetes.io/docs/concepts/workloads/pods/init-containers/
initContainers:
- name: my-connection-validator
image: path-to-your-image
env:
- name: POD_HOST
value: "localhost-or-ip"
- name: POD_PORT
value: "12345"

Kubernetes Endpoint created for Kafka but not reflecting in POD

In Kubernetes cluster I have created Endpoint pointing to Kafka cluster. Endpoint created successfully.
Name - kafka
Endpoint - X.X.X.X:9092
In my Spring Boot application's deployment yaml I have kept environment variable BROKER_IP. For this environment variable I have pointed:
env:
- name: BROKER_IP
value: kafka
The POD is in Error state. In my bootstrap-server I am getting kafka and not the actual Endpoint that was created. Any thoughts?
UPDATE - Just tried kafka:9092 and it worked. So wondering does the ENDPOINT maps to IP only and not the Port? Is my understanding correct??
Is it possible that you forgot to create the Service object matching the Endpoints? Because you are providing the ip-port pairs yourself the Service would need to be selectorless.
This works for me:
kind: Endpoints
apiVersion: v1
metadata:
name: kafka
subsets:
- addresses: [{ip: "1.2.3.4"}]
ports: [{port: 9092}]
---
kind: Service
apiVersion: v1
metadata:
name: kafka
spec:
ports: [{port: 9092}]
Testing it:
$ kubectl run kafka-dns-test --image=busybox --attach --rm --restart=Never -- nslookup kafka
If you don't see a command prompt, try pressing enter.
Server: 10.96.0.10
Address: 10.96.0.10:53
Name: kafka.default.svc.cluster.local
Address: 10.96.220.40
Successful lookup, ignore extra *** Can't find xxx: No answer messages
Also, because there is a Service object you get some environment variables in your Pods (without having to declare them):
KAFKA_PORT='tcp://10.96.220.40:9092'
KAFKA_PORT_9092_TCP='tcp://10.96.220.40:9092'
KAFKA_PORT_9092_TCP_ADDR='10.96.220.40'
KAFKA_PORT_9092_TCP_PORT='9092'
KAFKA_PORT_9092_TCP_PROTO='tcp'
KAFKA_SERVICE_HOST='10.96.220.40'
KAFKA_SERVICE_PORT='9092'
But the most flexible way to use a Service is still to use the dns name (kafka in this case).

Can't access Google Cloud Datastore from Google Kubernetes Engine cluster

I have a simple application that Gets and Puts information from a Datastore.
It works everywhere, but when I run it from inside the Kubernetes Engine cluster, I get this output:
Error from Get()
rpc error: code = PermissionDenied desc = Request had insufficient authentication scopes.
Error from Put()
rpc error: code = PermissionDenied desc = Request had insufficient authentication scopes.
I'm using the cloud.google.com/go/datastore package and the Go language.
I don't know why I'm getting this error since the application works everywhere else just fine.
Update:
Looking for an answer I found this comment on Google Groups:
In order to use Cloud Datastore from GCE, the instance needs to be
configured with a couple of extra scopes. These can't be added to
existing GCE instances, but you can create a new one with the
following Cloud SDK command:
gcloud compute instances create hello-datastore --project
--zone --scopes datastore userinfo-email
Would that mean I can't use Datastore from GKE by default?
Update 2:
I can see that when creating my cluster I didn't enable any permissions (which are disabled for most services by default). I suppose that's what's causing the issue:
Strangely, I can use CloudSQL just fine even though it's disabled (using the cloudsql_proxy container).
So what I learnt in the process of debugging this issue was that:
During the creation of a Kubernetes Cluster you can specify permissions for the GCE nodes that will be created.
If you for example enable Datastore access on the cluster nodes during creation, you will be able to access Datastore directly from the Pods without having to set up anything else.
If your cluster node permissions are disabled for most things (default settings) like mine were, you will need to create an appropriate Service Account for each application that wants to use a GCP resource like Datastore.
Another alternative is to create a new node pool with the gcloud command, set the desired permission scopes and then migrate all deployments to the new node pool (rather tedious).
So at the end of the day I fixed the issue by creating a Service Account for my application, downloading the JSON authentication key, creating a Kubernetes secret which contains that key, and in the case of Datastore, I set the GOOGLE_APPLICATION_CREDENTIALS environment variable to the path of the mounted secret JSON key.
This way when my application starts, it checks if the GOOGLE_APPLICATION_CREDENTIALS variable is present, and authenticates Datastore API access based on the JSON key that the variable points to.
Deployment YAML snippet:
...
containers:
- image: foo
name: foo
env:
- name: GOOGLE_APPLICATION_CREDENTIALS
value: /auth/credentials.json
volumeMounts:
- name: foo-service-account
mountPath: "/auth"
readOnly: true
volumes:
- name: foo-service-account
secret:
secretName: foo-service-account
After struggling some hours, I was also able to connect to the datastore. Here are my results, most of if from Google Docs:
Create Service Account
gcloud iam service-accounts create [SERVICE_ACCOUNT_NAME]
Get full iam account name
gcloud iam service-accounts list
The result will look something like this:
[SERVICE_ACCOUNT_NAME]#[PROJECT_NAME].iam.gserviceaccount.com
Give owner access to the project for the service account
gcloud projects add-iam-policy-binding [PROJECT_NAME] --member serviceAccount:[SERVICE_ACCOUNT_NAME]#[PROJECT_NAME].iam.gserviceaccount.com --role roles/owner
Create key-file
gcloud iam service-accounts keys create mycredentials.json --iam-account [SERVICE_ACCOUNT_NAME]#[PROJECT_NAME].iam.gserviceaccount.com
Create app-key Secret
kubectl create secret generic app-key --from-file=credentials.json=mycredentials.json
This app-key secret will then be mounted in the deployment.yaml
Edit deyployment file
deployment.yaml:
...
spec:
containers:
- name: app
image: eu.gcr.io/google_project_id/springapplication:v1
volumeMounts:
- name: google-cloud-key
mountPath: /var/secrets/google
env:
- name: GOOGLE_APPLICATION_CREDENTIALS
value: /var/secrets/google/credentials.json
ports:
- name: http-server
containerPort: 8080
volumes:
- name: google-cloud-key
secret:
secretName: app-key
I was using a minimalistic Dockerfile like:
FROM SCRATCH
ADD main /
EXPOSE 80
CMD ["/main"]
which kept my go app in an indefinite "hanging" state when trying to connect to the GCP Datastore. After LOTS of playing I figured out that the SCRATCH Docker image might be missing certain environment tools / variables / libraries which the Google cloud library requires. Using this Dockerfile now works:
FROM golang:alpine
RUN apk add --no-cache ca-certificates
ADD main /
EXPOSE 80
CMD ["/main"]
It does not require me to provide the google credentials environment variable. The library seems to automatically understand where it's running in (maybe from the context.Background() ?) and automatically uses a default service account which Google creates for you when you create your cluster on GKE.

Resources