Vertex pipeline model training component stuck running forever because of metadata issue - google-cloud-vertex-ai

I'm attempting to run a Vertex pipeline (custom model training) which I was able to run successfully in a different project. As far as I'm aware, all the pieces of infrastructure (service accounts, buckets, etc.) are identical.
The error appears in a gray box in the pipeline UI when I click on the model training component and reads the following:
Retryable error reported. System is retrying.
com.google.cloud.ai.platform.common.errors.AiPlatformException: code=ABORTED, message=Specified Execution `etag`: `1662555654045` does not match server `etag`: `1662555533339`, cause=null System is retrying.
I've looked into the log explorer and found that the error logs are audit logs have the following associated tags with them:
protoPayload.methodName="google.cloud.aiplatform.internal.MetadataService.RefreshLineageSubgraph"
protoPayload.resourceName="projects/724306335858/locations/europe-west4/metadataStores/default
Leading me to think that there's an issue with the Vertex Metadatastore or the way my pipeline is using it. The audit logs are automatic though, so I'm not sure.
I've tried purging the metadata store as well as deleting it completely. I've also tried running a different model training pipeline that worked before in a different project as well but with no luck.
screenshot of ui

Retryable error which you were getting is the temporary issue, the issue is resolved now.
You can now be able to rerun the pipeline and it is not expected to enter the infinite retry loop.

Related

Terraform and OCI : "The existing Db System with ID <OCID> has a conflicting state of UPDATING" when creating multiple databases

I am trying to create 30 databases (oci_database_database resource) under 5 existing db_homes. All of these resources are under a single DB System :
When applying my code, a first database is successfully created then when terraform attempts to create the second one I get the following error message : "Error: Service error:IncorrectState. The existing Db System with ID has a conflicting state of UPDATING", which causes the execution to stop.
If I re-apply my code, the second database is created then I get the same previous error when terraform attempts to create the third one.
I am assuming I get this message because terraform starts creating the following database as soon as the first one is created, but the DB System status is not up to date yet (still 'UPDATING' instead of 'AVAILABLE').
A good way for the OCI provider to avoid this issue would be to consider a database creation as completed when the creation is indeed completed AND the associated db home and db system's status are back to 'AVAILABLE'.
Any suggestion on how to adress the issue I am encountering ?
Feel free to ask if you need any additional information.
Thank you.
As mentioned above, it looks like you have opened a ticket regarding this via github. What you are experiencing should not happen, as terraform should retry after seeing the error. As per your github post, the person helping you is in need of your log with timestamp so they can better troubleshoot. At this stage I would recommend following up there and sharing the requested info.

Why does NIFI Flow in one machine can't sync the other machine occasionally?

I get the ERROR log and solve it by using the same flow by copying the one to others. However, many suggestion tells me to delete the different flows,and then the flow would be generated automatically. Sometimes this way can work,but can't do it now. More important,the log does show the reason why flow would be different? How to avoid this error?TIA
NIFI-APP LOG:
[![enter image description here][3]][3]

Google Cloud Data flow jobs failing with error 'Failed to retrieve staged files: failed to retrieve worker in 3 attempts: bad MD5...'

SDK: Apache Beam SDK for Go 0.5.0
We are running Apache Beam Go SDK jobs in Google Cloud Data Flow. They had been working fine until recently when they intermittently stopped working (no changes made to code or config). The error that occurs is:
Failed to retrieve staged files: failed to retrieve worker in 3 attempts: bad MD5 for /var/opt/google/staged/worker: ..., want ; bad MD5 for /var/opt/google/staged/worker: ..., want ;
(Note: It seems as if it's missing a second hash value in the error message message.)
As best I can guess there's something wrong with the worker - It seems to be trying to compare md5 hashes of the worker and missing one of the values? I don't know exactly what it's comparing to though.
Does anybody know what could be causing this issue?
The fix to this issue seems to have been to rebuild the worker_harness_container_image with the latest changes. I had tried this but I didn't have the latest release when I built it locally. After I pulled the latest from the Beam repo, and rebuilt the image (As per the notes here https://github.com/apache/beam/blob/master/sdks/CONTAINERS.md) and reran it seemed to work again.
I'm seeing the same thing. If I look into the Stackdriver logging I see this:
Handler for GET /v1.27/images/apache-docker-beam-snapshots-docker.bintray.io/beam/go:20180515/json returned error: No such image: apache-docker-beam-snapshots-docker.bintray.io/beam/go:20180515
However, I can pull the image just fine locally. Any ideas why Dataflow cannot pull.

Step Failure not reported by Composed Task Runner or reflected in Spring Cloud Dataflow Tables

Currently we are using Spring Cloud Dataflow to run a sequence of apps we have created based on a definition. Each of the apps we have made are spring batch jobs, with individual steps. The current issue we are having is that when one of these steps inside the app's batch job fails, it is reflected as expected in the step_execution, job_execution, and task_execution tables in the scdf database. However, we are not able to rerun any scdf job that has failed in an app from the top scdf level because it seems the row entry in the step_execution table for SCDF's step related to the overall app never propagates to FAILED in the status column, instead always being COMPLETED no matter what happens. Below I have included a picture which gets across what I am saying. test-simple8-test-app is the app we have created, while check-step, sleep-step, and should-error-step are steps inside the job for that app. You can see in the should-error-step that it has FAILED for both ExitCode and Status, while the entry for the app itself has COMPLETED for status and FAILED for ExitCode.
Relevant Table
We have tried altering what we report in the task_execution table since we saw CTR is looking for certain fields there, but it still seems it does not affect the Status column in step_executions. If we manually change the entry in the db to FAILED for that value, it proceeds as we would expect and as is normal for spring batch, in that it resumes the job from that app and re executes it.
Is there a good way to relieve this problem, or is it a problem with the way we are approaching it?
Edit: Added Flow Diagram for better clarity

Mesos framework stays inactive due to "Authentication failed: EOF"

I'm currently trying to deploy Eremetic (version 0.28.0) on top of Marathon using the configuration provided as an example. I actually have been able to deploy it once, but suddenly, after trying to redeploy it, the framework stays inactive.
By inspecting the logs I noticed a constant attempt to connect to some service that apparently never succeeds because of some authentication problem.
2017/08/14 12:30:45 Connected to [REDACTED_MESOS_MASTER_ADDRESS]
2017/08/14 12:30:45 Authentication failed: EOF
It looks like the service returning an error is ZooKeeper and more precisely it looks like the error can be traced back to this line in the Go ZooKeeper library. ZooKeeper however seems to work: I've tried to query it directly with zkCli and to run a small Spark job (where the Mesos master is given with zk:// URL) and everything seems to work.
Unfortunately I'm not able to diagnose the problem further, what could it be?
It turned out to be a configuration problem. The master URL was simply wrong and this is how the error was reported.

Resources