Flink on GCP No FileSystem for scheme: gs - hadoop

I've been trying to use Flink on GCP (https://github.com/spotify/flink-on-k8s-operator) but there is a problem with google cloud storage access.
So, I've just followed the steps that explained here (https://github.com/spotify/flink-on-k8s-operator/blob/master/images/flink/README.md)
So, I've created a docker image like;
ARG GCS_CONNECTOR_VERSION=latest-hadoop2
ARG FLINK_HADOOP_VERSION=2.8.3-10.0
ARG GCS_CONNECTOR_NAME=gcs-connector-${GCS_CONNECTOR_VERSION}.jar
ARG GCS_CONNECTOR_URI=https://storage.googleapis.com/hadoop-lib/gcs/${GCS_CONNECTOR_NAME}
ARG FLINK_HADOOP_JAR_NAME=flink-shaded-hadoop-2-uber-${FLINK_HADOOP_VERSION}.jar
ARG FLINK_HADOOP_JAR_URI=https://repo.maven.apache.org/maven2/org/apache/flink/flink-shaded-hadoop-2-uber/${FLINK_HADOOP_VERSION}/${FLINK_HADOOP_JAR_NAME}
RUN echo "Downloading ${GCS_CONNECTOR_URI}" && \
wget -q -O /opt/flink/lib/${GCS_CONNECTOR_NAME} ${GCS_CONNECTOR_URI}
RUN echo "Downloading ${FLINK_HADOOP_JAR_URI}" && \
wget -q -O /opt/flink/lib/${FLINK_HADOOP_JAR_NAME} ${FLINK_HADOOP_JAR_URI}
I can see the jars on task manager and job manager's lib folder after deploying job, but task manager throws error like;
org.apache.flink.core.fs.UnsupportedFileSystemSchemeException: Could not find a file system implementation for scheme 'gs'. The scheme is not directly supported by Flink and no Hadoop file system to support this scheme could be loaded. For a full list of supported file systems, please see https://ci.apache.org/projects/flink/flink-docs-stable/ops/filesystems/.
The interesting thing here is that the task manager throws an error but I can see the base path that should be created for the checkpoint on GCS successfully. For example;
I gave gs://bucket/flink/job/checkpoint config for checkpoint, i can see this folder after deploying but of course there is no data inside.
What can the problem be?

You should check the official GCS connector docs. Basically you need to copy the optional gcs plugin under the plugins directory to make it available to Flink in your container image.
In adittion to this I recommend you check out the recently added Flink Kubernetes Operator project which should provide you some benefits over your current setup and improve integration with newer Flink versions.

Related

Scaling up a Serverless Web Crawler and Search Engine in aws

https://github.com/aws-samples/aws-step-functions-kendra-web-crawler-search-engine
I was referring above link and implementing web crawling on particular website.
I have deployed the stack using command deploy --profile <YOUR_AWS_PROFILE> --with-kendra
but when i am using
crawl --profile <YOUR_AWS_PROFILE> --name lambda-docs --base-url https://docs.aws.amazon.com/ --start-paths /lambda --keywords lambda/latest/dg
it is giving me error:
'/crawl' is not recognized as an internal or external command,
operable program or batch file.
in the link it has been shown like "When the infrastructure has been deployed, you can trigger a run of the crawler with the included utility script"
is there any something to install the crawl command.
That should be ./crawl according to the README of the project.
Your error message also sounds like it's coming from Windows but the crawl script is written in Bash so you may run in to issues unless you switch to Linux/MacOS/BSD (or WSL).

Kedro deployment to databricks

Maybe I misunderstand the purpose of packaging but it doesn't seem to helpful in creating an artifact for production deployment because it only packages code. It leaves out the conf, data, and other directories that make the kedro project reproducible.
I understand that I can use docker or airflow plugins for deployment but what about deploying to databricks. Do you have any advice here?
I was thinking about making a wheel that could be installed on the cluster but I would need to package the conf first. Another option is to just sync a git workspace to the cluster and run kedro via a notebook.
Any thoughts on a best practice?
If you are not using docker and just using kedro to deploy directly on a databricks cluster. This is how we have been deploying kedro to databricks.
CI/CD pipeline builds using kedro package. Creates a wheel file.
Upload dist and conf to dbfs or AzureBlob file copy (if using Azure Databricks)
This will upload everything to databricks on every git push
Then you can have a notebook with the following:
You can have an init script in databricks something like:
from cargoai import run
from cargoai.pipeline import create_pipeline
branch = dbutils.widgets.get("branch")
conf = run.get_config(
project_path=f"/dbfs/project_name/build/cicd/{branch}"
)
catalog = run.create_catalog(config=conf)
pipeline = create_pipeline()
Here conf, catalog, and pipeline will be available
Call this init script when you want to run a branch or a master branch in production like: %run "/Projects/InitialSetup/load_pipeline" $branch="master"
For development and testing, you can run specific nodespipeline = pipeline.only_nodes_with_tags(*tags)
Then run a full or a partial pipeline with just SequentialRunner().run(pipeline, catalog)
In production, this notebook can be scheduled by databricks. If you are on Azure Databricks, you can use Azure Data Factory to schedule and run this.
I found the best option was to just use another tool for packaging, deploying, and running the job. Using mlflow with kedro seems like a good fit. I do most everything in Kedro but use MLFlow for the packaging and job execution: https://medium.com/#QuantumBlack/deploying-and-versioning-data-pipelines-at-scale-942b1d81b5f5
name: My Project
conda_env: conda.yaml
entry_points:
main:
command: "kedro install && kedro run"
Then running it with:
mlflow run -b databricks -c cluster.json . -P env="staging" --experiment-name /test/exp
So there is a section of the documentation that deals with Databricks:
https://kedro.readthedocs.io/en/latest/04_user_guide/12_working_with_databricks.html
The easiest way to get started will probably be to sync with git and run via a Databricks notebook. However, as mentioned, there are other ways using the ".whl" and referencing the "conf" folder.

"No filesystem found for scheme gs" when running dataflow in google cloud platform

I am running my google dataflow job in Google Cloud Platform(GCP).
When I run this job locally it worked well, but when running it on GCP, I got this error
"java.lang.IllegalArgumentException: No filesystem found for scheme gs".
I have access to that google cloud URI, I can upload my jar file to that URI and I can see some temporary file for my local job.
My Job id in GCP:
2019-08-08_21_47_27-162804342585245230 (beam version:2.12.0)
2019-08-09_16_41_15-11728697820819900062 (beam version:2.14.0)
I have tried beam version of 2.12.0 and 2.14.0, both of them have the same error.
java.lang.IllegalArgumentException: No filesystem found for scheme gs
at org.apache.beam.sdk.io.FileSystems.getFileSystemInternal(FileSystems.java:456)
at org.apache.beam.sdk.io.FileSystems.matchNewResource(FileSystems.java:526)
at org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers.resolveTempLocation(BigQueryHelpers.java:689)
at org.apache.beam.sdk.io.gcp.bigquery.BigQuerySourceBase.extractFiles(BigQuerySourceBase.java:125)
at org.apache.beam.sdk.io.gcp.bigquery.BigQuerySourceBase.split(BigQuerySourceBase.java:148)
at org.apache.beam.runners.dataflow.worker.WorkerCustomSources.splitAndValidate(WorkerCustomSources.java:284)
at org.apache.beam.runners.dataflow.worker.WorkerCustomSources.performSplitTyped(WorkerCustomSources.java:206)
at org.apache.beam.runners.dataflow.worker.WorkerCustomSources.performSplitWithApiLimit(WorkerCustomSources.java:190)
at org.apache.beam.runners.dataflow.worker.WorkerCustomSources.performSplit(WorkerCustomSources.java:169)
at org.apache.beam.runners.dataflow.worker.WorkerCustomSourceOperationExecutor.execute(WorkerCustomSourceOperationExecutor.java:78)
at org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.executeWork(BatchDataflowWorker.java:412)
at org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.doWork(BatchDataflowWorker.java:381)
at org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.getAndPerformWork(BatchDataflowWorker.java:306)
at org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.doWork(DataflowBatchWorkerHarness.java:135)
at org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:115)
at org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:102)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
This may be caused by a couple of issues if you build a "fat jar" that bundles all of your dependencies.
You must include the dependency org.apache.beam:google-cloud-platform-core to have the Beam GCS filesystem.
Inside your far jar, you must preserve the META-INF/services/org.apache.beam.sdk.io.FileSystemRegistrar file with a line org.apache.beam.sdk.extensions.gcp.storage.GcsFileSystemRegistrar. You can find this file in the jar from step 1. You will probably have many files with the same name in your dependencies, registering different Beam filesystems. You need to configure maven or gradle to combine these as part of your build or they will overwrite each other and not work properly.
There is also one more reason for this exception.
Make sure you create pipeline (e.g. Pipeline.create(options)) before you try to access files.
[GOLANG] In my case it was solved by applying the below imports for side-effects
import (
_ "github.com/apache/beam/sdks/go/pkg/beam/io/filesystem/gcs"
_ "github.com/apache/beam/sdks/go/pkg/beam/io/filesystem/local"
_ "github.com/apache/beam/sdks/go/pkg/beam/io/filesystem/memfs"
)
It's normal. On your computer, you are using internal file with your tests (/.... In Linux, c:... In Windows). However, Google cloud storage isn't a an internal file system (btw it's not a file system) and thus the "gs://" can't be interpreted.
Try TextIO.read.from(...).
You can use it for internal and external files like GCS .
However, I experienced an issue, months ago on Windows environment, when I developed in Windows. C: wasn't a known scheme (same error as yours).
It's possible that works now (I'm no longer on Windows, I can't test). Else, you have this workaround pattern: set a variable in your config object and perform a test on it like:
If (environment config variable is local)
p.apply(FileSystems.getFileSystemInternal...);
Else
p.apply(TextIO.read.from(...));

Codebuild Workflow with environment variables

I have a monolith github project that has multiple different applications that I'd like to integrate with an AWS Codebuild CI/CD workflow. My issue is that if I make a change to one project, I don't want to update the other. Essentially, I want to create a logical fork that deploys differently based on the files changed in a particular commit.
Basically my project repository looks like this:
- API
-node_modules
-package.json
-dist
-src
- REACTAPP
-node_modules
-package.json
-dist
-src
- scripts
- 01_install.sh
- 02_prebuild.sh
- 03_build.sh
- .ebextensions
In terms of Deployment, my API project gets deployed to elastic beanstalk and my REACTAPP gets deployed as static files to S3. I've tried a few things but decided that the only viable approach is to manually perform this deploy step within my own 03_build.sh script - because there's no way to build this dynamically within Codebuild's Deploy step (I could be wrong).
Anyway, my issue is that I essentially need to create a decision tree to determine which project gets excecuted, so if I make a change to API and push, it doesn't automatically deploy REACTAPP to S3 unnecessarliy (and vica versa).
I managed to get this working on localhost by updating environment variables at certain points in the build process and then reading them in separate steps. However this fails on Codedeploy because of permission issues i.e. I don't seem to be able to update env variables from within the CI process itself.
Explicitly, my buildconf.yml looks like this:
version: 0.2
env:
variables:
VARIABLES: 'here'
AWS_ACCESS_KEY_ID: 'XXXX'
AWS_SECRET_ACCESS_KEY: 'XXXX'
AWS_REGION: 'eu-west-1'
AWS_BUCKET: 'mybucket'
phases:
install:
commands:
- sh ./scripts/01_install.sh
pre_build:
commands:
- sh ./scripts/02_prebuild.sh
build:
commands:
- sh ./scripts/03_build.sh
I'm running my own shell scripts to perform some logic and I'm trying to pass variables between scripts: install->prebuild->build
To give one example, here's the 01_install.sh where I diff each project version to determine whether it needs to be updated (excuse any minor errors in bash):
#!/bin/bash
# STAGE 1
# _______________________________________
# API PROJECT INSTALL
# Do if API version was changed in prepush (this is just a sample and I'll likely end up storing the version & previous version within the package.json):
if [[ diff ./api/version.json ./api/old_version.json ]] > /dev/null 2>&1
## then
echo "πŸ€– Installing dependencies in API folder..."
cd ./api/ && npm install
## Set a variable to be used by the 02_prebuild.sh script
TEST_API="true"
export TEST_API
else
echo "No change to API"
fi
# ______________________________________
# REACTAPP PROJECT INSTALL
# Do if REACTAPP version number has changed (similar to above):
...
Then in my next stage I read these variables to determine whether I should run tests on the project 02_prebuild.sh:
#!/bin/bash
# STAGE 2
# _________________________________
# API PROJECT PRE-BUILD
# Do if install was initiated
if [[ $TEST_API == "true" ]]; then
echo "πŸ€– Run tests on API project..."
cd ./api/ && npm run tests
echo $TEST_API
BUILD_API="true"
export BUILD_API
else
echo "Don't test API"
fi
# ________________________________
# TODO: Complete for REACTAPP, similar to above
...
In my final script I use the BUILD_API variable to build to the dist folder, then I deploy that to either Elastic Beanstalk (for API) or S3 (for REACTAPP).
When I run this locally it works, however when I run it on Codebuild I get a permissions failure presumably because my bash scripts cannot export ENV_VAR. I'm wondering either if anyone knows how to update ENV_VARIABLES from within the build process itself, or if anyone has a better approach to achieve my goals (conditional/ variable build process on Codebuild)
EDIT:
So an approach that I've managed to get working is instead of using Env variables, I'm creating new files with specific names using fs then reading the contents of the file to make logical decisions. I can access these files from each of the bash scripts so it works pretty elegantly with some automatic cleanup.
I won't edit the original question as it's still an issue and I'd like to know how/ if other people solved this. I'm still playing around with how to actually use the eb deploy and s3 cli commands within the build scripts as codebuild does not seem to come with the eb cli installed and my .ebextensions file does not seem to be honoured.
Source control repos like Github can be configured to send a post event to an API endpoint when you push to a branch. You can consume this post request in lambda through API Gateway. This event data includes which files were modified with the commit. The lambda function can then process this event to figure out what to deploy. If you’re struggling with deploying to your servers from the codebuild container, you might want to try posting an artifact to s3 with an installable package and then have your server grab it from there.

Jar not found error while while trying to deploy SCDF Stream

I registered the sink first as follows:
app register --name mysink --type sink --uri file:///Users/swatikaushik/Downloads/kafkaStreamDemo/target/kafkaStreamDemo-0.0.1-SNAPSHOT.jar
Then I created a stream
stream create --definition β€œ:myKafkaTopic > mysink" --name myStreamName --deploy
I got the error
Command failed org.springframework.cloud.dataflow.rest.client.DataFlowClientException: File
/Users/swatikaushik/Downloads/kafkaStreamDemo/target/kafkaStreamDemo-0.0.1-SNAPSHOT.jar must exist
While the jar exists!!
I'v followed the maven local repository mounting approach, using the docker compose, hope this helps:
Maven:
mvn clean install
Setup your environment variables:
$Env:DATAFLOW_VERSION="2.5.1.RELEASE"
$Env:SKIPPER_VERSION="2.4.1.RELEASE"
$Env:HOST_MOUNT_PATH="C:\Users\yourUserName\.m2"
$Env:DOCKER_MOUNT_PATH="/root/.m2/"
Restart/start the containers:
docker-compose down
docker-compose up
Register your apps:
app register --type sink--name mysink --uri maven://groupId:artifactId:version
Register Doc
File permission is one thing - please double check as advised.
A few other ideas:
1) Run app info sink:mysink. If the JAR is actually available, it should return with a list of Boot/Whitelisted properties of the Application.
2) Run the Jar standalone. Make sure it actually starts via java -jar.....
3) The stream definition appear to include a special character (β€œ:myKafkaTopic > mysink" instead of ":myKafkaTopic > mysink" - notice the β€œ character); it would fail in the Shell, but it looks like you were able to deploy it. A full stacktrace would help.
We just had the same error as described above.
We had mounted the folder of the jar files to the skipper.
The solution was, that we had to mount the jars to the data-flow server docker container as well.
Skipper is deploying it, but dataflow server registers it.

Resources