Cloud Automation Manager on IBM Cloud Private - deployement not available and pods pending - ibm-cloud-private

I installed the Helm Release of CAM on the catalog, however, the individual components of CAM are not being deployed. There doesn't seem to be any deployement available and all the pods are pending of CAM.
Screenshot of Deployments of CAM on ICP
With NFS installed, once I deploy CAM from the catalog, the PVs are now bounded to the PV claims from the start. However, the same problem persists, there are no available deployments for CAM and the pods are stuck at either init:0/1, pending, or ContainerCreating without any change.
Edit:
When I checked the pods, the pending error was due to insufficient memory, so I added another worker node and I no longer have any pending deployments. However I still do have the issue with pods being stuck at init(0/1), ContainerCreating and ImagePullBackoff.
Here are some of the errors: Init(0/1) Error ImagePullBackoff ContainerCreating

could be a handful of issues, depending on how many try/retry of the CAM deploy you have attempted:
- delete the PVCs
- edit the PVs to remove any PV claim (or delete the PV and recreate it)
- ensure the NFS exports are defined correctly: (See bottom of page)
https://www.ibm.com/support/knowledgecenter/SS2L37_3.1.0.0/cam_create_pv.html
- remove any prior files/data from the PV locations on disk
- delete the failed CAM chart deploy if still there
If you're facing more trouble, suggest to open a support case so we can help you right away!
ibm.biz/icpsupport
thx.

This problem may related to the CAM PVs not created or not bonded. Could you please check 4 CAM PVs created before deploy CAM
From ICP console > Platform > Storage , you should see 4 CAM PVs as follows:
Please check https://www.ibm.com/support/knowledgecenter/SS2L37_3.1.0.0/cam_create_pv.html regarding how to create CAM PVs

Please review the options for offline installation of IBM Cloud Automation Manager. With an offline installation, it can take hours until the pod cam-iaas is running.
https://developer.ibm.com/cloudautomation/2018/10/18/ibm-cloud-automation-manager-3-1-delivers-improved-offline-installation-experience/

Related

Kubernetes HTTP Traffic goes to pod being updated

I'm having a pipeline that has the following structure:
Step 1: kubernetes-deployment
Step 2: kubectl rollout restart deploy -n NAMESPACE
Step 3: http-calls to deployment A and B
In the same Namespace, there's a database pod and Pod A and B are connected to this database.
The problem
The problem now is caused by rolling updates - when applying a rolling update, kubernetes starts new pods as the deployment got updated. The old pod is not terminated until the corresponding new pod starts, though.
As kubectl rollout restart deploy is a non-blocking call, it will not wait for the update to finish. And afaik, there is no builtin way of kubectl to force such behavior.
As I'm executing some HTTP Requests after this was called, i now got the problem that sometimes, when the update is not fast enough, the HTTP Calls are received and answered by the old pods from deployment A and B. Shortly after this, the old pods will be terminated, as the new ones are up and running.
This leads to the problem that the effects of those HTTP requests are no longer visible, as they were received by the old pods, saving the corresponding data in a database located in the "old" database pod. As database pod is restarted, the data will be lost.
Note that I am not using a persistent Volume in this case, as this comes from a nightly build scenario and i want to restart those deployments every day and the database state should always contain only the data from the current day's build.
Do you see any way of solving this?
Including a simple wait step would probably work, but I am curious if there is a fancier solution.
Thanks in Advance!
kubectl rollout status deployment <deploymentname> together with startupProbe and livenessProbe solved my problem.

Getting production error readiness for iot edge on raspberry pi

I installed Iotedge on my raspberry pi two weeks ago with no problems by following the steps here: https://learn.microsoft.com/en-us/azure/iot-edge/how-to-install-iot-edge-linux
This week i turned my raspberry pi on and I am now getting the following error when i run iotedge check. I am also getting a 406 error when i check its status in the IoT Hub
Error: "production readiness: Edge Agent's storage directory is persisted on the host filesystem-Error
Could not check current state of edgeAgent container
production readiness: Edge Hub's storage directory is persisted on the host filesystem-Error
Could not check current state of edgeHub container"
When i run it with --verbose, i get:
"production readiness: Edge Agent's storage directory is persisted on the host filesystem-Error
The edgeAgent module is not configured to persist its /tmp/edgeAgent directory on the host filesystem.
Data might be lost if the module is deleted or updated.
Please see https://aka.ms/iotedge-storage-host for best practices"
If anyone can help me out, I'd really appreciate it.
Updated errors after I linked the module storage to device storages:
enter image description here
Would you share edge agent and hub twin properties defined in azure portal? Remember to remove secret/password/connection string before sharing.
The production readiness logs are Warnings and you can ignore them if you are just doing tests on an experimental scenario. These warnings will not impact the outcome of the tutorial you followed.
When you're ready to take your IoT Edge solution from development into production, make sure that it's configured for ongoing performance like described here: Prepare to deploy your IoT Edge solution in production

Cloud Automation Manager Pods on CrashLoopBackOff

I'm having an issue where some of my Pods are on CrashLoopBackOff when I try to deploy CAM from the Catalog. I also followed the instructions in the IBM documentation to clear the data from PVs (By doing rm -Rf /export/CAM_db/*) and purge the previous installations of CAM.
Here are the pods that are on CrashLoopBackOff:
Cam Pods
Here's the specific error when I describe the pod:
MongoDB Pod
Ro-
It is almost always the case that if the cam-mongo pod does not come up properly, the issue is with the PV unable to mount/read/access the actual disk location or the data itself which is on the PV.
Since your pod events indicates container image already exists, and scoped to the store, it seems like you have already tried before to install CAM and its using CE version from the Docker store, correct?
If a prior deploy did not go well, do clean up the disk locations as per the doc,
https://www.ibm.com/support/knowledgecenter/SS2L37_3.1.0.0/cam_uninstalling.html
but like you showed I can see you already tried by cleaning CAM_db, so do the same for the CAM_logs, CAM_bpd and CAM_terraform locations.
Make a note of our install troubleshooting section as it describes a few scenarios in which CAM mongo can be impacted:
https://www.ibm.com/support/knowledgecenter/SS2L37_3.1.0.0/ts_cam_install.html
in the bottom of the PV Create topic, we provide some guidance around the NFS mount options that work best, please review it:
https://www.ibm.com/support/knowledgecenter/SS2L37_3.1.0.0/cam_create_pv.html
Hope this helps you make some forward progress!
The postStart error you can effectively ignore, it means mongo container probably failed to start, so it kills a post script.
This issue usually is due to NFS configuration issue.
I would recommend you to try the troubleshooting steps here in the section that has cam-mongo pod is in CrashLoopBackoff
https://www.ibm.com/support/knowledgecenter/SS2L37_3.1.0.0/ts_cam_install.html
If it's NFS, typically it's things like
-no_root_squash is missing on base directory
-fsid=0 needs to be removed on the base directory for that setup
-folder permissions.
Note. I have seen another customer experiencing this issue and the problem was caused by NFS: there were .snapshot file there already, they have to remove it at first.

Chef for Large scale web Deployment in windows

I am trying to do the MSI web deployment with chef. I have about 400 web servers with same configuration. We will do deployment in two slots with 200 servers each.
I will follow below steps for new release,
1) Increase the cookbook version.
2) Upload the cookbook to server.
3) Update the cookbook version to role and run list.
I will do lot of steps from cookbook like install 7 msi, update IIS settings, update web.configure file and add registry entry. Once deployment is done we need to update testing team, so that they can start the testing. My question is how could I ensure deployment is done in all the machines successfully? How could I find if one MSI is not installed in one machine or one web.config file is not updated properly?
My understanding is chef client will run every 30 Mins default, so I have wait for next 30 mins to complete the deployment. Is there any other way with push (I can’t use push job, since chef is removed push job support from chef High Availability servers) like knife chef client from workstation?
It would be fine, If anyone share their experience who is using chef in large scale windows deployment.
Thanks in advance.
I personnaly use rundeck to trigger on demand chef runs.
According to your description, I would use 2 prod env, one for each group where you'll bump the cookbook version limitation for each group separately.
For the reporting, at this scale consider buying a license to get chef-manage and chef-reporting so you'll have a complete overview, next option is to use a handler to report the run status and send a mail if there was an error during the run.
Nothing in here is specific to Windows, so more you are asking how to use Chef in a high-churn environment. I would highly recommend checking out the new Policyfile workflow, we've had a lot of success with it though it has some sharp limitations. I've got a guide up at https://yolover.poise.io/. Another solution on the cookbook/data release side is to move a lot of your tunables (eg. versions of things to deploy) out of the cookbook and in to a little web service somewhere, than have your recipe code read from that to get their tuning data. As for the push vs. pull question, most people end up with a hybrid. As #Tensibai mentioned, RunDeck is a popular push-based option. Usually you still leave background interval runs on a longer cycle time (maybe 1 or 2 hours) to catch config drift and use the push system for more specific deploy tasks. Beyond RunDeck you can also check out Fabric, Capistrano, MCollective, and SaltStack (you can use its remote execution layer without the CM stuffs). Chef also has its own Push Jobs project but I think I can safely say you should avoid it at this point, it never got enough community momentum to really go anywhere.

Autoscaling for azure cloud service not working

As the new autoscaling functionality became available in windows azure a few weeks ago i enabled this on my service and everything was great.
Then I deleted the deployment and deployed it again. And now I get an error saying that "Autoscale failed for [role name]" and I can only manually scale it.
I also tried deleting the service alltogether and recreate it from cratch with no improvments. This is the only service that I have this problem with and if I deploy the same solution to Another service it works.
Does anyone know how to get around this?

Resources