Google Cloud Dataflow read from Elasticsearch error - elasticsearch

I have an Apache Beam pipeline reading from an AWS Elasticsearch cluster. The pipeline code is as follows:
PCollection<String> output =
pipeline.apply(
ElasticsearchIO.read().withConnectionConfiguration(
ElasticsearchIO.ConnectionConfiguration.create(
hostName,
options.getElasticSearchIndex(),
options.getElasticSearchType()
)
)
);
output.apply(TextIO.write().to("gs://<bucket-name>/test.txt")
.withSuffix(".txt"));
pipeline.run();
The job is deployed without any errors. I set the maxNumWorkers to 3 just to test my code initially. But the pipeline stalls and does not process any of the data.
When I look at the Google Cloud Logs, I see the following log entries:
Proposing dynamic split of work unit <project>;<hash> at {"fractionConsumed":0.5}
Rejecting split request because custom reader returned null residual source.
I see these logs getting generated over and over again. I let the pipeline run for 20-30 mins but there seems to be no progress.
I'm not sure how to debug this issue. Any thoughts on how to proceed?
EDIT: Updated pipeline code.

Related

Vertex pipeline model training component stuck running forever because of metadata issue

I'm attempting to run a Vertex pipeline (custom model training) which I was able to run successfully in a different project. As far as I'm aware, all the pieces of infrastructure (service accounts, buckets, etc.) are identical.
The error appears in a gray box in the pipeline UI when I click on the model training component and reads the following:
Retryable error reported. System is retrying.
com.google.cloud.ai.platform.common.errors.AiPlatformException: code=ABORTED, message=Specified Execution `etag`: `1662555654045` does not match server `etag`: `1662555533339`, cause=null System is retrying.
I've looked into the log explorer and found that the error logs are audit logs have the following associated tags with them:
protoPayload.methodName="google.cloud.aiplatform.internal.MetadataService.RefreshLineageSubgraph"
protoPayload.resourceName="projects/724306335858/locations/europe-west4/metadataStores/default
Leading me to think that there's an issue with the Vertex Metadatastore or the way my pipeline is using it. The audit logs are automatic though, so I'm not sure.
I've tried purging the metadata store as well as deleting it completely. I've also tried running a different model training pipeline that worked before in a different project as well but with no luck.
screenshot of ui
Retryable error which you were getting is the temporary issue, the issue is resolved now.
You can now be able to rerun the pipeline and it is not expected to enter the infinite retry loop.

Elastic Cloud APM not showing logs in Transactions Page

What makes Kibana to not show docker container logs in APM "Transactions" page under "Logs" tab.
I verified the logs are successfully being generated with the "trace.id" associated for proper linking.
I have the exact same environment and configs (7.16.2) up via docker-compose and it works perfectly.
Could not figure out why this feature works locally but does not show in Elastic Cloud deploy.
UPDATE with Solution:
I just solved the problem.
It's related to the Filebeat version.
From 7.16.0 and ON, the transaction/logs linking stops working.
Reverted Filebeat back to version 7.15.2 and it started working again.
If you are not using file beats, for example - We rolled our own logging implementation to send logs from a queue in batches using the Bulk API.
We have our own "ElasticLog" class and then use Attributes to match the logs-* Schema for the Log Stream.
In particular we had to make sure that trace.id was the same as the the actual Traces, trace.id property. Then the logs started to show up here (It does take a few minutes sometimes)
Some more info on how to get the ID's
We use OpenTelemetry exporter for Traces and ILoggerProvider for Logs. The fire off batches independently of each other.
We populate the Trace Id's at the time of instantiation of the class as a default value. This way you in the context of the Activity. Also helps set the timestamp exactly when the log was created.
This LogEntry then gets passed into the ElasticLogger processor and mapped as displayed above to the ElasticLog entry with the Attributes needed for ES

Step name already exists error when using dataflow runner

Cross-posting from https://groups.google.com/forum/#!topic/kythe/86kNuSCeorI, since I was directed here by Beam faq for Beam questions.
In short, I run a job written using the golang sdk successfully using the direct runner, but trying to use the dataflow runner I get the following error in the google cloud console:
2019-02-17 (12:03:53) Step with name e19 already exists. Duplicates
are not allowed.
I attach the plan that was printed to the stderr at https://pastebin.com/vpu3U52j. Grepping for e19: https://pastebin.com/L24L1guT.
I'm not very familiar with beam yet. I wonder which part is responsible for generating the step names? What are likely causes of a collision?
Thank you!
It was a bug actually, sent PR to beam.

Google Cloud Data flow jobs failing with error 'Failed to retrieve staged files: failed to retrieve worker in 3 attempts: bad MD5...'

SDK: Apache Beam SDK for Go 0.5.0
We are running Apache Beam Go SDK jobs in Google Cloud Data Flow. They had been working fine until recently when they intermittently stopped working (no changes made to code or config). The error that occurs is:
Failed to retrieve staged files: failed to retrieve worker in 3 attempts: bad MD5 for /var/opt/google/staged/worker: ..., want ; bad MD5 for /var/opt/google/staged/worker: ..., want ;
(Note: It seems as if it's missing a second hash value in the error message message.)
As best I can guess there's something wrong with the worker - It seems to be trying to compare md5 hashes of the worker and missing one of the values? I don't know exactly what it's comparing to though.
Does anybody know what could be causing this issue?
The fix to this issue seems to have been to rebuild the worker_harness_container_image with the latest changes. I had tried this but I didn't have the latest release when I built it locally. After I pulled the latest from the Beam repo, and rebuilt the image (As per the notes here https://github.com/apache/beam/blob/master/sdks/CONTAINERS.md) and reran it seemed to work again.
I'm seeing the same thing. If I look into the Stackdriver logging I see this:
Handler for GET /v1.27/images/apache-docker-beam-snapshots-docker.bintray.io/beam/go:20180515/json returned error: No such image: apache-docker-beam-snapshots-docker.bintray.io/beam/go:20180515
However, I can pull the image just fine locally. Any ideas why Dataflow cannot pull.

Step Failure not reported by Composed Task Runner or reflected in Spring Cloud Dataflow Tables

Currently we are using Spring Cloud Dataflow to run a sequence of apps we have created based on a definition. Each of the apps we have made are spring batch jobs, with individual steps. The current issue we are having is that when one of these steps inside the app's batch job fails, it is reflected as expected in the step_execution, job_execution, and task_execution tables in the scdf database. However, we are not able to rerun any scdf job that has failed in an app from the top scdf level because it seems the row entry in the step_execution table for SCDF's step related to the overall app never propagates to FAILED in the status column, instead always being COMPLETED no matter what happens. Below I have included a picture which gets across what I am saying. test-simple8-test-app is the app we have created, while check-step, sleep-step, and should-error-step are steps inside the job for that app. You can see in the should-error-step that it has FAILED for both ExitCode and Status, while the entry for the app itself has COMPLETED for status and FAILED for ExitCode.
Relevant Table
We have tried altering what we report in the task_execution table since we saw CTR is looking for certain fields there, but it still seems it does not affect the Status column in step_executions. If we manually change the entry in the db to FAILED for that value, it proceeds as we would expect and as is normal for spring batch, in that it resumes the job from that app and re executes it.
Is there a good way to relieve this problem, or is it a problem with the way we are approaching it?
Edit: Added Flow Diagram for better clarity

Resources