Migrate kubeflow docker image components to VertexAI pipeline - google-cloud-vertex-ai

I am trying to migrate a custom component created in kubeflow to VertexAI.
In Kubeflow I used to create components as docker container images and then load them into my pipeline as follows:
def my_custom_component_op(gcs_dataset_path: str, some_param: str):
return kfp.dsl.ContainerOp(
name='My Custom Component Step',
image='gcr.io/my-project-23r2/my-custom-component:latest',
arguments=["--gcs_dataset_path", gcs_dataset_path,
'--component_param', some_param],
file_outputs={
'output': '/app/output.csv',
}
)
I would then use them in the pipeline as follows:
#kfp.dsl.pipeline(
name='My custom pipeline',
description='The custom pipeline'
)
def generic_pipeline(project_id, some_param):
output_component = my_custom_component_op(
gcs_dataset_path=gcs_dataset_path,
some_param=some_param
)
output_next_op = next_op(gcs_dataset_path=dsl.InputArgumentPath(
output_component.outputs['output']),
next_op_param="some other param"
)
Can I reuse the same component docker image from kubeflow v1 in vertex ai pipeline? How can I do that? hopefully without changing anything in the component itself.
I have found examples online of vertex AI pipelines that uses the #component decorator as follows:
#component(base_image=PYTHON37, packages_to_install=[PANDAS])
def my_component_op(
gcs_dataset_path: str,
some_param: str
dataset: Output[Dataset],
):
...perform some op....
But this would require me to copy paste the docker code in my pipeline and this is not really something I want to do. Is there a way to re-use the docker image and passing the parameters? I couldn't find any example of that anywhere.

You need to prepare component yaml and load it with load_component_from_file.
It's well documented on kfp v2 Kubeflow documentation page, it's also written here.

Related

Configure subnetwork for vertex ai pipeline component

I have a vertex ai pipeline component that needs to connect to a database. This database exists in a VPC network. Currently my component is failing because it is not able to connect to the database, but I believe I can get it to work if I can configure the component to use the subnetwork.
How do I configure the workerPoolSpecs of the component to use the subnetwork?
I was hoping I could do something like that:
preprocess_data_op = component_store.load_component('org/ml_engine/preprocess')
#dsl.pipeline(name="test-pipeline-vertex-ai")
def pipeline(project_id: str, some_param: str):
preprocess_data_op(
project_id=project_id,
my_param=some_param,
subnetwork_uri="projects/xxxxxxxxx/global/networks/data",
).set_display_name("Preprocess data")
However the param is not there, and i get
TypeError: Preprocess() got an unexpected keyword argument 'subnetwork_uri'
How do I define the subnetwork for the component?
From Google docs, There is no mention of how you can run a specific component on a subnetwork.
However, it is possible to run the entire pipeline in a subnetwork by passing in the subnetwork as part of the job submit api.
job.submit(service_account=SERVICE_ACCOUNT, network=NETWORK)

I am not able to create a feature store in vertexAI using labels

I am passing the values of lables as below to create a featurestore with labels. But after creation of the featurestore, I do not see the featurestore created with labels. Is it still not supported in VertexAI
fs = aiplatform.Featurestore.create(
featurestore_id=featurestore_id,
labels=dict(project='retail', env='prod'),
online_store_fixed_node_count=online_store_fixed_node_count,
sync=sync
)
As mentioned in this featurestore documentation:
A featurestore is a top-level container for entity types, features,
and feature values.
With this, the GCP console UI "labels" are the "labels" at the Feature level.
Once a featurestore is created, you will need to create an entity and then create a Feature that has the labels parameter as shown on the below sample python code.
from google.cloud import aiplatform
test_label = {'key1' : 'value1'}
def create_feature_sample(
project: str,
location: str,
feature_id: str,
value_type: str,
entity_type_id: str,
featurestore_id: str,
):
aiplatform.init(project=project, location=location)
my_feature = aiplatform.Feature.create(
feature_id=feature_id,
value_type=value_type,
entity_type_name=entity_type_id,
featurestore_id=featurestore_id,
labels=test_label,
)
my_feature.wait()
return my_feature
create_feature_sample('your-project','us-central1','test_feature3','STRING','test_entity3','test_fs3')
Below is the screenshot of the GCP console which shows that labels for test_feature3 feature has the values defined in the above sample python code.
You may refer to this creation of feature documentation using python for more details.
On the other hand, you may still view the labels you defined for your featurestore using the REST API as shown on the below sample.
curl -X GET \
-H "Authorization: Bearer "$(gcloud auth application-default print-access-token) \
"https://<your-location>-aiplatform.googleapis.com/v1/projects/<your-project>/locations/<your-location>/featurestores"
Below is the result of the REST API which also shows the value of the labels I defined for my "test_fs3" featurestore.

How to set OpenSearch/Elasticsearch as the destination of a Kinesis Firehose?

I am trying to create Data Stream -> Firehose -> OpenSearch infrastructure using the AWS CDK v2. I was surprised to find that, although OpenSearch is a supported Firehose destination, there is nothing in the CDK to support this use case.
In my CDK Stack I have created an OpenSearch Domain, and am trying to create a Kinesis Firehose DeliveryStream with that domain as the destination. However, kinesisfirehose-destinations package seems to only have a ready-to-use destination for S3 buckets, so there is no obvious way to do this easily using only the constructs supplied by the aws-cdk, not even using the alpha packages.
I think I should be able to write an OpenSearch destination construct by implementing IDestination. I have tried the following simplistic implementation:
import {Construct} from "constructs"
import * as firehose from "#aws-cdk/aws-kinesisfirehose-alpha"
import {aws_opensearchservice as opensearch} from "aws-cdk-lib"
export class OpenSearchDomainDestination implements firehose.IDestination {
private readonly dest: opensearch.Domain
constructor(dest: opensearch.Domain) {
this.dest = dest
}
bind(scope: Construct, options: firehose.DestinationBindOptions): firehose.DestinationConfig {
return {dependables: [this.dest]}
}
}
then I can use it like so,
export class MyStack extends Stack {
...
private createFirehose(input: kinesis.Stream, output: opensearch.Domain) {
const destination = new OpenSearchDomainDestination(output)
const deliveryStream = new firehose.DeliveryStream(this, "FirehoseDeliveryStream", {
destinations: [destination],
sourceStream: input,
})
input.grantRead(deliveryStream)
output.grantWrite(deliveryStream)
}
}
This will compile and cdk synth runs just fine. However, I get the following error when running cdk deploy:
CREATE_FAILED | AWS::KinesisFirehose::DeliveryStream | ... Resource handler returned message: "Exactly one destination configuration is supported for a Firehose
I'm not sure I understand this message but it seems to imply that it will reject outright everything except the one provided S3 bucket destination.
So, my titular question could be answered by the answer to either of these two questions:
How are you supposed to implement bind in IDestination?
Are there any complete working examples of creating a Firehose to OpenSearch using the non-alpha L1 constructs?
(FYI I have also asked this question on the AWS forum but have not yet received an answer.)
Other destinations (at the moment) than S3 are not supported by the L2 constructs. This is described at https://docs.aws.amazon.com/cdk/api/v1/docs/aws-kinesisfirehose-readme.html
In such cases, I go to the source code to see what can be done. See https://github.com/aws/aws-cdk/blob/master/packages/%40aws-cdk/aws-kinesisfirehose/lib/destination.ts . There is no easy way how to inject other destination than S3 since the DestinationConfig does not support it. You can see at https://github.com/aws/aws-cdk/blob/master/packages/%40aws-cdk/aws-kinesisfirehose-destinations/lib/s3-bucket.ts how the config for S3 is crafted. And you can see how that config is used to translate to L1 construct CfnDeliveryStream at https://github.com/aws/aws-cdk/blob/f82d96bfed427f8e49910ac7c77004765b2f5f6c/packages/%40aws-cdk/aws-kinesisfirehose/lib/delivery-stream.ts#L364
Probably easiest way at the moment is to write down your L1 constructs to define destination as OpenSearch.

Calling gdal2tiles.py in AWS LAMBDA function

I'm trying to call gdal2tiles.py in AWS Lambda function using GeoLambda layer.
I can't figure out how to call this script form the lambda function.
My lambda function looks like this so far:
import json
import os
from osgeo import gdal
def lambda_handler(event, context):
os.system("gdal2tiles.py -p -z [0-6] test.jpg")
In the log I have this error: sh: gdal2tiles.py: command not found
Any idea how to solve this? Thank you.
one way to do it is to import gdal2tiles utilities from the GeoLambda layer that you added to your lambda function.
For example:
gdal2tiles.generate_tiles('/path/to/input_file', '/path/to/output_dir/'), nb_processes=2, zoom='0-6')
Read more about in gdal2tiles
Edit:
Ok i made it to work with these set of layer attached to the lambda.
The first 2 layers were straight from the Github
arn:aws:lambda:us-east-1:552188055668:layer:geolambda-python:3
arn:aws:lambda:us-east-1:552188055668:layer:geolambda:4
The 3rd layer is our gdal2tiles which is created locally and attached to lambda fucntion
arn:aws:lambda:us-east-1:246990787935:layer:gdaltiles:1
you can download the zip from here
And i hope you added the below Environment vairable to your lambda function configuration
GDAL_DATA=/opt/share/gdal
PROJ_LIB=/opt/share/proj (only needed for GeoLambda 2.0.0+)

How to use computed properties in github actions

I am trying to rebuild my ci-cd within the new github actions yaml format (new), the issue is that I can't seem to use computed values as arguments in a step.
I have tried the following
- name: Download Cache
uses: ./.github/actions/cache
with:
entrypoint: restore_cache
args: --bucket=gs://[bucket secret] --key=node-modules-cache-$(checksum package.json)-node-12.7.0
However "$(checksum package.json)" is not valid as part of an argument.
Please not this has nothing to do with if the command checksum exists, it does exist within the container.
I'm trying to copy this kind of setup in google cloud build
- name: gcr.io/$PROJECT_ID/restore_cache
id: restore_cache_node
args:
- '--bucket=gs://${_CACHE_BUCKET}'
- '--key=node-modules-cache-$(checksum package.json)-node-${_NODE_VERSION}'
I expected to be able to use compute arguments in a similar way to other ci-cd solutions.
Is there a way to do this that I am missing? Maybe being able to use 'run:' within a docker container to run some commands.
The only solution I'm aware of at the moment is to compute the value in a previous step so that you can use it in later steps.
See this answer for a method using set-output. This is the method I would recommend for passing computed values between workflow steps.
Github Actions, how to share a calculated value between job steps?
Alternatively, you can create environment variables. Computed environment variables can also be used in later steps.
How do I set an env var with a bash expression in GitHub Actions?

Resources