How to get logs from runtime packages_to_install in python kfp.v2.dsl.component running on Vertex AI? - google-cloud-vertex-ai

When running a kfp pipeline with custom components (python function wrappers) that use a base image and packages_to_install on top of that, the component may silently fail without any descriptive error.
from kfp.v2.dsl import component
#component(
base_image=f"{MY_BASE_IMAGE}",
packages_to_install=MY_ADDITIONAL_PACKAGES_LIST,
)
def python_function():
.
.
.
The replica workerpool0-0 exited with a non-zero status of 1. Termination reason: Error.
As it fails to produce any logs from the actual function run, my guess is that it fails during packages_to_install phase due to some broken dependencies between the base_image and the packages I try to install on top of it.
To localize the exact problem, I would like to check the logs of this additional package installation (which I imagine is something like a Docker RUN that does pip install), but I haven't found any logs from that step produced in Vertex.
Any ideas on how to get your hands on those logs? Thanks!
Digging through KFP component documentation, haven't found any information about the logs.
Solution to the silent component failure I think would be to limit the usage of this additional packages installation step and pack as much as we can into the base image, but still want to be able to see the logs for additional packages.

Related

Vertex pipeline model training component stuck running forever because of metadata issue

I'm attempting to run a Vertex pipeline (custom model training) which I was able to run successfully in a different project. As far as I'm aware, all the pieces of infrastructure (service accounts, buckets, etc.) are identical.
The error appears in a gray box in the pipeline UI when I click on the model training component and reads the following:
Retryable error reported. System is retrying.
com.google.cloud.ai.platform.common.errors.AiPlatformException: code=ABORTED, message=Specified Execution `etag`: `1662555654045` does not match server `etag`: `1662555533339`, cause=null System is retrying.
I've looked into the log explorer and found that the error logs are audit logs have the following associated tags with them:
protoPayload.methodName="google.cloud.aiplatform.internal.MetadataService.RefreshLineageSubgraph"
protoPayload.resourceName="projects/724306335858/locations/europe-west4/metadataStores/default
Leading me to think that there's an issue with the Vertex Metadatastore or the way my pipeline is using it. The audit logs are automatic though, so I'm not sure.
I've tried purging the metadata store as well as deleting it completely. I've also tried running a different model training pipeline that worked before in a different project as well but with no luck.
screenshot of ui
Retryable error which you were getting is the temporary issue, the issue is resolved now.
You can now be able to rerun the pipeline and it is not expected to enter the infinite retry loop.

Google Cloud Data flow jobs failing with error 'Failed to retrieve staged files: failed to retrieve worker in 3 attempts: bad MD5...'

SDK: Apache Beam SDK for Go 0.5.0
We are running Apache Beam Go SDK jobs in Google Cloud Data Flow. They had been working fine until recently when they intermittently stopped working (no changes made to code or config). The error that occurs is:
Failed to retrieve staged files: failed to retrieve worker in 3 attempts: bad MD5 for /var/opt/google/staged/worker: ..., want ; bad MD5 for /var/opt/google/staged/worker: ..., want ;
(Note: It seems as if it's missing a second hash value in the error message message.)
As best I can guess there's something wrong with the worker - It seems to be trying to compare md5 hashes of the worker and missing one of the values? I don't know exactly what it's comparing to though.
Does anybody know what could be causing this issue?
The fix to this issue seems to have been to rebuild the worker_harness_container_image with the latest changes. I had tried this but I didn't have the latest release when I built it locally. After I pulled the latest from the Beam repo, and rebuilt the image (As per the notes here https://github.com/apache/beam/blob/master/sdks/CONTAINERS.md) and reran it seemed to work again.
I'm seeing the same thing. If I look into the Stackdriver logging I see this:
Handler for GET /v1.27/images/apache-docker-beam-snapshots-docker.bintray.io/beam/go:20180515/json returned error: No such image: apache-docker-beam-snapshots-docker.bintray.io/beam/go:20180515
However, I can pull the image just fine locally. Any ideas why Dataflow cannot pull.

Talend ESB deployment on runtime

I'm on Talend ESB Runtime.
I encountered problems while starting ./trun. Nothing on the screen appeared after start. The process is launched but I can't get anything else...
Anyway I tryed to deployed a job, and there is something weird in the log about org.osgi.framework.bundleException in tesb.log.
And Karaf.log is OK
Here tesb.log :
tesb.log
karaf.log :
karaf.log
log in repository data :
timestamplog
I don't know how to investigate, because logs are poor and JVM is equal between Talend ESB and the runtime...
Can you help me please?
You only showed a small snippet of the log. From this I can already see that at least one bundle can not be resolved. This means that this bundle can not be used. In the snippet the bundle seems to be a user bundle but I am pretty sure you have other such log messages that show that one of the main bundles of karaf can not be loaded.
If you want to check for the cause of the problem look into these messages and search for non optional package that are not resolved. Usually this leads to a missing bundle.
If you simply want to get your system running again you can simply reset karaf by using
./trun clean
Remember though that you then have to reinstall all features again.

(Error starting container: API error (500) Hyperledger

I am using bluemix network to deploy and test my custom chaincode( link to the chaincode). I'm using hte Swagger API to deploy, invoke and query my chaincode. The deploy and invoke work fine but when I try to query my chaincode, I keep getting the following error
Following is the validating peer logs :
Is it some problem with my query code or network issue. Any help is appreciated.
The error likely happened during the deploy phase (the logs just shows the query). The "deploy" being an asynchronous transaction returning an ID (just "submits" the transaction to be processed later) cannot indicate if the actual execution of the transaction will be successful or not. But the "query" request is synchronous and shows a failure.
Looking at the chaincode, the error is almost certainly due to the import and use of "github.com/op/go-logging" package. As the fabric only copies the chaincode and does not pick up its dependencies, that package is not available at deploy time.
Note that the same code will work when under "github.com/hyperledger/fabric" path as "github.com/op/go-logging" is available as a "vendor" package in that path.
To test this, try commenting out the import statement and all logging from the code (make sure "go build" works locally first with the changes).

Veins LTE Omnet++ Error

I have successfully run the example veins LTE sceranio on Ubuntu 14.04, Sumo-0.22, Omnet++-4.6 using the command ./run (no debug)
The heterogeneous.rou.xml file has more nodes marked as comments. When I add some new nodes from this file I get an error
<!> Error in module (HeterogeneousToLTE) scenario.node[0].heterogeneousToLTE (id=58) at event #21075, t=5.6: IPvXAddressResolver: module `node[3]' not found.
Is it possible to run the scenario with a command or a change to ingore the nodes that will be not found? before I continue with my own scenario map and route files.
Do you know how to sovle this problem?
The error tells you that no node[3] exists. Make sure that this node is already in the simulation.
Please note that the SimpleApp is a very basic application, you might have to extend your application so that it sends the messages to the correct car.

Resources