Vertex API - Failed to build Pipeline Internal Error - google-cloud-vertex-ai

I built a dataset using 30 books in .txt format. Loaded them up started training and pipeline build and it kept failing with an "Internal Error". Does anyone out there have an idea of what is the root cause of this error? I am using the full Vertex AI AutoML preconfigured models as this is a quick demo. Please Advise..

Related

Vertex pipeline model training component stuck running forever because of metadata issue

I'm attempting to run a Vertex pipeline (custom model training) which I was able to run successfully in a different project. As far as I'm aware, all the pieces of infrastructure (service accounts, buckets, etc.) are identical.
The error appears in a gray box in the pipeline UI when I click on the model training component and reads the following:
Retryable error reported. System is retrying.
com.google.cloud.ai.platform.common.errors.AiPlatformException: code=ABORTED, message=Specified Execution `etag`: `1662555654045` does not match server `etag`: `1662555533339`, cause=null System is retrying.
I've looked into the log explorer and found that the error logs are audit logs have the following associated tags with them:
protoPayload.methodName="google.cloud.aiplatform.internal.MetadataService.RefreshLineageSubgraph"
protoPayload.resourceName="projects/724306335858/locations/europe-west4/metadataStores/default
Leading me to think that there's an issue with the Vertex Metadatastore or the way my pipeline is using it. The audit logs are automatic though, so I'm not sure.
I've tried purging the metadata store as well as deleting it completely. I've also tried running a different model training pipeline that worked before in a different project as well but with no luck.
screenshot of ui
Retryable error which you were getting is the temporary issue, the issue is resolved now.
You can now be able to rerun the pipeline and it is not expected to enter the infinite retry loop.

Why does NIFI Flow in one machine can't sync the other machine occasionally?

I get the ERROR log and solve it by using the same flow by copying the one to others. However, many suggestion tells me to delete the different flows,and then the flow would be generated automatically. Sometimes this way can work,but can't do it now. More important,the log does show the reason why flow would be different? How to avoid this error?TIA
NIFI-APP LOG:
[![enter image description here][3]][3]

Step name already exists error when using dataflow runner

Cross-posting from https://groups.google.com/forum/#!topic/kythe/86kNuSCeorI, since I was directed here by Beam faq for Beam questions.
In short, I run a job written using the golang sdk successfully using the direct runner, but trying to use the dataflow runner I get the following error in the google cloud console:
2019-02-17 (12:03:53) Step with name e19 already exists. Duplicates
are not allowed.
I attach the plan that was printed to the stderr at https://pastebin.com/vpu3U52j. Grepping for e19: https://pastebin.com/L24L1guT.
I'm not very familiar with beam yet. I wonder which part is responsible for generating the step names? What are likely causes of a collision?
Thank you!
It was a bug actually, sent PR to beam.

Google Cloud Data flow jobs failing with error 'Failed to retrieve staged files: failed to retrieve worker in 3 attempts: bad MD5...'

SDK: Apache Beam SDK for Go 0.5.0
We are running Apache Beam Go SDK jobs in Google Cloud Data Flow. They had been working fine until recently when they intermittently stopped working (no changes made to code or config). The error that occurs is:
Failed to retrieve staged files: failed to retrieve worker in 3 attempts: bad MD5 for /var/opt/google/staged/worker: ..., want ; bad MD5 for /var/opt/google/staged/worker: ..., want ;
(Note: It seems as if it's missing a second hash value in the error message message.)
As best I can guess there's something wrong with the worker - It seems to be trying to compare md5 hashes of the worker and missing one of the values? I don't know exactly what it's comparing to though.
Does anybody know what could be causing this issue?
The fix to this issue seems to have been to rebuild the worker_harness_container_image with the latest changes. I had tried this but I didn't have the latest release when I built it locally. After I pulled the latest from the Beam repo, and rebuilt the image (As per the notes here https://github.com/apache/beam/blob/master/sdks/CONTAINERS.md) and reran it seemed to work again.
I'm seeing the same thing. If I look into the Stackdriver logging I see this:
Handler for GET /v1.27/images/apache-docker-beam-snapshots-docker.bintray.io/beam/go:20180515/json returned error: No such image: apache-docker-beam-snapshots-docker.bintray.io/beam/go:20180515
However, I can pull the image just fine locally. Any ideas why Dataflow cannot pull.

Distributed JMeter test fails with java error but test will run from JMeter UI (non-distributed)

My goal is to run a load test using 4 Azure servers as load generators and 1 Azure server to initiate the test and gather results. I had the distributed test running and I was getting good data. But today when I remote start the test 3 of the 4 load generators fail with all the http transactions erroring. The failed transactions log the following error:
Non HTTP response message: java.lang.ClassNotFoundException: org.apache.commons.logging.impl.Log4jFactory (Caused by java.lang.ClassNotFoundException: org.apache.commons.logging.impl.Log4jFactory)
I confirmed the presence of commons-logging-1.2.jar in the jmeter\lib folder on each machine.
To try to narrow down the issue I set up one Azure server to both initiate the load and run JMeter-server but this fails too. However, if I start the test from the JMeter UI on that same server the test runs OK. I think this rules out a problem in the script or a problem with the Azure machines talking to each other.
I also simplified my test plan down to where it only runs one simple http transaction and this still fails.
I've gone through all the basics: reinstalled jmeter, updated java to the latest version (1.8.0_111), updated the JAVA_HOME environment variable and backed out the most recent Microsoft Security update on the server. Any advice on how to pick this problem apart would be greatly appreciated.
I'm using JMeter 3.0r1743807 and Java 1.8
The Azure servers are running Windows Server 2008 R2
I did get a resolution to this problem. It turned out to be a conflict between some extraneous code in a jar file and a component of JMeter. It was “spooky” because something influenced the load order of referenced jar files and JMeter components.
I had included a jar file in my JMeter script using the “Add directory or jar to classpath” function in the Test Plan. This jar file has a piece of code I needed for my test along with many other components and one of those components, probably a similar logging function, conflicted with a logging function in JMeter. The problem was spooky; the test ran fine for months but started failing at the maximally inconvenient time. The problem was revealed by creating a very simple JMeter test that would load and run just fine. If I opened the simple test in JMeter then, without closing JMeter, opened my problem test, my problem test would not fail. If I reversed the order, opening the problem test followed by the simple test then the simple test would fail too. Given that the problem followed the order in which things loaded I started looking at the jar files and found my suspect.
When I built the script I left the jar file alone thinking that the functions I need might have dependencies to other pieces within the jar. Now that things are broken I need to find out if that is true and happily it is not. So, to fix the problem I changed the extension on my jar file to zip then edited it in 7-zip. I removed all the code except what I needed. I kept all the folders in the path to my needed code, I did this for two reasons; I did not have to update my code that called the functions and when I tried changing the path the functions did not work.
Next I changed the extension on the file back to jar and changed the reference in JMeter’s “Add directory or jar to classpath” function to point to the revised jar. I haven’t seen the failure since.
Many thanks to the folks who looked at this. I hope the resolution will help someone out.

Resources