Cloud Automation Manager Pods on CrashLoopBackOff - ibm-cloud-private

I'm having an issue where some of my Pods are on CrashLoopBackOff when I try to deploy CAM from the Catalog. I also followed the instructions in the IBM documentation to clear the data from PVs (By doing rm -Rf /export/CAM_db/*) and purge the previous installations of CAM.
Here are the pods that are on CrashLoopBackOff:
Cam Pods
Here's the specific error when I describe the pod:
MongoDB Pod

Ro-
It is almost always the case that if the cam-mongo pod does not come up properly, the issue is with the PV unable to mount/read/access the actual disk location or the data itself which is on the PV.
Since your pod events indicates container image already exists, and scoped to the store, it seems like you have already tried before to install CAM and its using CE version from the Docker store, correct?
If a prior deploy did not go well, do clean up the disk locations as per the doc,
https://www.ibm.com/support/knowledgecenter/SS2L37_3.1.0.0/cam_uninstalling.html
but like you showed I can see you already tried by cleaning CAM_db, so do the same for the CAM_logs, CAM_bpd and CAM_terraform locations.
Make a note of our install troubleshooting section as it describes a few scenarios in which CAM mongo can be impacted:
https://www.ibm.com/support/knowledgecenter/SS2L37_3.1.0.0/ts_cam_install.html
in the bottom of the PV Create topic, we provide some guidance around the NFS mount options that work best, please review it:
https://www.ibm.com/support/knowledgecenter/SS2L37_3.1.0.0/cam_create_pv.html
Hope this helps you make some forward progress!

The postStart error you can effectively ignore, it means mongo container probably failed to start, so it kills a post script.
This issue usually is due to NFS configuration issue.
I would recommend you to try the troubleshooting steps here in the section that has cam-mongo pod is in CrashLoopBackoff
https://www.ibm.com/support/knowledgecenter/SS2L37_3.1.0.0/ts_cam_install.html
If it's NFS, typically it's things like
-no_root_squash is missing on base directory
-fsid=0 needs to be removed on the base directory for that setup
-folder permissions.
Note. I have seen another customer experiencing this issue and the problem was caused by NFS: there were .snapshot file there already, they have to remove it at first.

Related

Kubernetes HTTP Traffic goes to pod being updated

I'm having a pipeline that has the following structure:
Step 1: kubernetes-deployment
Step 2: kubectl rollout restart deploy -n NAMESPACE
Step 3: http-calls to deployment A and B
In the same Namespace, there's a database pod and Pod A and B are connected to this database.
The problem
The problem now is caused by rolling updates - when applying a rolling update, kubernetes starts new pods as the deployment got updated. The old pod is not terminated until the corresponding new pod starts, though.
As kubectl rollout restart deploy is a non-blocking call, it will not wait for the update to finish. And afaik, there is no builtin way of kubectl to force such behavior.
As I'm executing some HTTP Requests after this was called, i now got the problem that sometimes, when the update is not fast enough, the HTTP Calls are received and answered by the old pods from deployment A and B. Shortly after this, the old pods will be terminated, as the new ones are up and running.
This leads to the problem that the effects of those HTTP requests are no longer visible, as they were received by the old pods, saving the corresponding data in a database located in the "old" database pod. As database pod is restarted, the data will be lost.
Note that I am not using a persistent Volume in this case, as this comes from a nightly build scenario and i want to restart those deployments every day and the database state should always contain only the data from the current day's build.
Do you see any way of solving this?
Including a simple wait step would probably work, but I am curious if there is a fancier solution.
Thanks in Advance!
kubectl rollout status deployment <deploymentname> together with startupProbe and livenessProbe solved my problem.

Getting production error readiness for iot edge on raspberry pi

I installed Iotedge on my raspberry pi two weeks ago with no problems by following the steps here: https://learn.microsoft.com/en-us/azure/iot-edge/how-to-install-iot-edge-linux
This week i turned my raspberry pi on and I am now getting the following error when i run iotedge check. I am also getting a 406 error when i check its status in the IoT Hub
Error: "production readiness: Edge Agent's storage directory is persisted on the host filesystem-Error
Could not check current state of edgeAgent container
production readiness: Edge Hub's storage directory is persisted on the host filesystem-Error
Could not check current state of edgeHub container"
When i run it with --verbose, i get:
"production readiness: Edge Agent's storage directory is persisted on the host filesystem-Error
The edgeAgent module is not configured to persist its /tmp/edgeAgent directory on the host filesystem.
Data might be lost if the module is deleted or updated.
Please see https://aka.ms/iotedge-storage-host for best practices"
If anyone can help me out, I'd really appreciate it.
Updated errors after I linked the module storage to device storages:
enter image description here
Would you share edge agent and hub twin properties defined in azure portal? Remember to remove secret/password/connection string before sharing.
The production readiness logs are Warnings and you can ignore them if you are just doing tests on an experimental scenario. These warnings will not impact the outcome of the tutorial you followed.
When you're ready to take your IoT Edge solution from development into production, make sure that it's configured for ongoing performance like described here: Prepare to deploy your IoT Edge solution in production

Cloud Automation Manager on IBM Cloud Private - deployement not available and pods pending

I installed the Helm Release of CAM on the catalog, however, the individual components of CAM are not being deployed. There doesn't seem to be any deployement available and all the pods are pending of CAM.
Screenshot of Deployments of CAM on ICP
With NFS installed, once I deploy CAM from the catalog, the PVs are now bounded to the PV claims from the start. However, the same problem persists, there are no available deployments for CAM and the pods are stuck at either init:0/1, pending, or ContainerCreating without any change.
Edit:
When I checked the pods, the pending error was due to insufficient memory, so I added another worker node and I no longer have any pending deployments. However I still do have the issue with pods being stuck at init(0/1), ContainerCreating and ImagePullBackoff.
Here are some of the errors: Init(0/1) Error ImagePullBackoff ContainerCreating
could be a handful of issues, depending on how many try/retry of the CAM deploy you have attempted:
- delete the PVCs
- edit the PVs to remove any PV claim (or delete the PV and recreate it)
- ensure the NFS exports are defined correctly: (See bottom of page)
https://www.ibm.com/support/knowledgecenter/SS2L37_3.1.0.0/cam_create_pv.html
- remove any prior files/data from the PV locations on disk
- delete the failed CAM chart deploy if still there
If you're facing more trouble, suggest to open a support case so we can help you right away!
ibm.biz/icpsupport
thx.
This problem may related to the CAM PVs not created or not bonded. Could you please check 4 CAM PVs created before deploy CAM
From ICP console > Platform > Storage , you should see 4 CAM PVs as follows:
Please check https://www.ibm.com/support/knowledgecenter/SS2L37_3.1.0.0/cam_create_pv.html regarding how to create CAM PVs
Please review the options for offline installation of IBM Cloud Automation Manager. With an offline installation, it can take hours until the pod cam-iaas is running.
https://developer.ibm.com/cloudautomation/2018/10/18/ibm-cloud-automation-manager-3-1-delivers-improved-offline-installation-experience/

Undeploying Business Network

Using HyperLedger Composer 0.19.1, I can't find a way to undeploy my business network. I don't necessarily want to upgrade to a newer version each time, but rather replacing the one deployed with a fix in the JS code for instance. Any replacement for the undeploy command that existed before?
There is no replacement for the old undeploy command, and in fact it it not really undeploy - merely hiding the old network.
Be aware that everytime you upgrade a network it creates a new Docker Image and Container so you may want to tidy these up periodically. (You could also try to delete the BNA from the Peer servers but these are very small in comparison to the docker images.)
It might not help your situation, but if you are rapidly developing and iterating you could try this in the online Playground or local Playground with the Web profile - this is fast and does not create any new images/containers.

best way to bundle update on server while booting

I have an AMI which has configured with production code setup.I am using Nginx + unicorn as server setup.
The problem I am facing is, whenever traffic goes up I need to boot the instance log in to instance and do a git pull,bundle update and also precompile the assets.Which is time consuming.So I want to avoid all this process.
Now I want to go with a script/process where I can automate whole deployment process, like git pull, bundle update and precompile as soon as I boot a new instance from this AMI.
Is there any best way process to get this done ? Any help would be appreciated.
You can place your code in /etc/rc.local (commands in this file will be executed when server will be loaded).
But the best way is using (capistrano). You need to add require "capistrano/bundler" to your deploy.rb file, and bundle update will be runned automatically. For more information you can read this article: https://semaphoreapp.com/blog/2013/11/26/capistrano-3-upgrade-guide.html
An alternative approach is to deploy your app to a separate EBS volume (you can still mount this inside /var/www/application or wherever it currently is)
After deploying you create an EBS snapshot of this volume. When you create a new instance, you tell ec2 to create a new volume for your instance from the snapshot, so the instance will start with the latest gems/code already installed (I find bundle install can take several minutes). All your startup script needs to do is to mount the volume (or if you have added it to the fstab when you make the ami then you don't even need to do that). I much prefer scaling operations like this to have no dependencies (eg what would you do if github or rubygems have an outage just when you need to deploy)
You can even take this a step further by using amazon's autoscaling service. In a nutshell you create a launch configuration where you specify the ami, instance type, volume snapshots etc. Then you control the group size either manually (through the web console or the api) according to a fixed schedule or based on cloudwatch metrics. Amazon will create or destroy instances as needed, using the information in your launch configuration.

Resources