How to set OpenSearch/Elasticsearch as the destination of a Kinesis Firehose? - elasticsearch

I am trying to create Data Stream -> Firehose -> OpenSearch infrastructure using the AWS CDK v2. I was surprised to find that, although OpenSearch is a supported Firehose destination, there is nothing in the CDK to support this use case.
In my CDK Stack I have created an OpenSearch Domain, and am trying to create a Kinesis Firehose DeliveryStream with that domain as the destination. However, kinesisfirehose-destinations package seems to only have a ready-to-use destination for S3 buckets, so there is no obvious way to do this easily using only the constructs supplied by the aws-cdk, not even using the alpha packages.
I think I should be able to write an OpenSearch destination construct by implementing IDestination. I have tried the following simplistic implementation:
import {Construct} from "constructs"
import * as firehose from "#aws-cdk/aws-kinesisfirehose-alpha"
import {aws_opensearchservice as opensearch} from "aws-cdk-lib"
export class OpenSearchDomainDestination implements firehose.IDestination {
private readonly dest: opensearch.Domain
constructor(dest: opensearch.Domain) {
this.dest = dest
}
bind(scope: Construct, options: firehose.DestinationBindOptions): firehose.DestinationConfig {
return {dependables: [this.dest]}
}
}
then I can use it like so,
export class MyStack extends Stack {
...
private createFirehose(input: kinesis.Stream, output: opensearch.Domain) {
const destination = new OpenSearchDomainDestination(output)
const deliveryStream = new firehose.DeliveryStream(this, "FirehoseDeliveryStream", {
destinations: [destination],
sourceStream: input,
})
input.grantRead(deliveryStream)
output.grantWrite(deliveryStream)
}
}
This will compile and cdk synth runs just fine. However, I get the following error when running cdk deploy:
CREATE_FAILED | AWS::KinesisFirehose::DeliveryStream | ... Resource handler returned message: "Exactly one destination configuration is supported for a Firehose
I'm not sure I understand this message but it seems to imply that it will reject outright everything except the one provided S3 bucket destination.
So, my titular question could be answered by the answer to either of these two questions:
How are you supposed to implement bind in IDestination?
Are there any complete working examples of creating a Firehose to OpenSearch using the non-alpha L1 constructs?
(FYI I have also asked this question on the AWS forum but have not yet received an answer.)

Other destinations (at the moment) than S3 are not supported by the L2 constructs. This is described at https://docs.aws.amazon.com/cdk/api/v1/docs/aws-kinesisfirehose-readme.html
In such cases, I go to the source code to see what can be done. See https://github.com/aws/aws-cdk/blob/master/packages/%40aws-cdk/aws-kinesisfirehose/lib/destination.ts . There is no easy way how to inject other destination than S3 since the DestinationConfig does not support it. You can see at https://github.com/aws/aws-cdk/blob/master/packages/%40aws-cdk/aws-kinesisfirehose-destinations/lib/s3-bucket.ts how the config for S3 is crafted. And you can see how that config is used to translate to L1 construct CfnDeliveryStream at https://github.com/aws/aws-cdk/blob/f82d96bfed427f8e49910ac7c77004765b2f5f6c/packages/%40aws-cdk/aws-kinesisfirehose/lib/delivery-stream.ts#L364
Probably easiest way at the moment is to write down your L1 constructs to define destination as OpenSearch.

Related

AWS-CDK: Cross account Resource Access and Resource reference

I have a secret key-value pair in Secrets Manager in Account-1 in us-east-1. This secret is encrypted using a Customer managed KMS key - let's call it KMS-Account-1. All this has been created via console.
Now we turn to CDK. We have cdk.pipelines.CodePipeline which deploys Lambda to multiple stages/environments - so 1st to { Account-2, us-east-1 } then to { Account-3, eu-west-1 } and so on. This has been done.
The lambda code in all stages/environments above, now needs to be changed to use the secret key-value pair present with Account-1's us-east-1 SecretsManager by getting it via secretsmanager client. That code should probably look like this (python):
client = boto3.session.Session().client(
service_name = 'secretsmanager',
region_name = 'us-east-1'
)
resp = client.get_secret_value(
SecretId='arn:aws:secretsmanager:us-east-1:<ACCOUNT-1>:secret:name/of/the/secret'
)
secret = json.loads(resp['SecretString'])
All lambdas in various accounts and regions (ie. environments) will have the exact same code as above since the secret needs to be fetched from Account-1 in us-east-1.
Firstly I hope this is conceptually possible. Is that right?
Next how do I change the cdk code to facilitate this? How will the code-deploy in code-pipeline get permission to import this custom kms key and SecretManager' secretand apply correct permissions for cross account access by the lambdas that the cdk pipeline creates ?
Can someone please give some pointers?
This is bit tricky as CloudFormation, and hence CDK, doesn't allow cross account/cross stage references because CloudFormation export doesn't work cross account as far as my understanding goes. All these patterns of "centralised" resources fall into that category - ie. resource in one account (or a stage in CDK) referenced by other stages.
If the resource is created outside the context of CDK (like via console), then you might as well hardcode the names/arns/etc. throughout the CDK code where its used and that should be sufficient.
For resources that have the ability to hold resource based policies, it's simpler as you can just attach the cross-account access permissions to them directly - again, offline via console since you are maintaining it manually anyway. Each time you add a stage (account) to your pipeline, you will need to go to the resource and add cross-account permissions manually.
For resources that don't have resource based policies, like SSM for eg., things are a bit roundabout as you will need to create a Role that can be assumed cross-account and then access the resource. In that case you will have to separately maintain the IAM Role too and manually update the trust policy to other accounts as you add stages to your CDK pipeline. Then, as usual hardcode the role arn in your CDK code, assume it in some CustomResource lambda and use it.
It gets more interesting if the creation is also done in the CDK code itself (ie. managed by CloudFormation - not done separately via console/aws-cli etc.). In this case, many times you wouldn't "know" the exact ARNs as the physical-id would be generated by CloudFormation and likely be a part of the ARN. Even influencing the physical-id yourself (like by hardcoding the bucket name) might not solve it in all cases. Eg. KMS ARNs and SecretManager ARNs append unique-ids or some sort of hashes to the end of the ARN.
Instead of trying to work all that out, it would be best left untouched and let CFn generate whatever random name/arn it chooses. To then reference these constructs/ARNs, just put them into SSM Parameters in the source/central account. SSM doesn't have resource based policy that I know of. So additionally create a role in cdk that trusts the accounts in your cdk code. Once done, there is no more maintenance - each time you add new environments/accounts to CDK (assuming its a cdk pipeline here), the "loop" construct that you will create will automatically add the new account into the trust relationship.
Now all you need to do is to distribute this role-arn and the SSM Parameternames to other stages. Choose an explicit role-name and SSM Parameters. The manual ARN construction given a rolename is pretty straightforward. So distribute that and SSM Parameters around your CDK code to other stages (compile time strings instead of references). In target stages, create custom-resource(s) (AWSCustomResource) backed by AwsSdkCall lambda to simply assume this role-arn and make the SDK call to retrieve the SSM Parameter values. These values can be anything, like your KMS ARNs, SecretManager's full ARNs etc. which you couldn't easily guess. Now simply use these.
Roundabout way to do a simple thing, but so far that is all I could do to get this to work.
#You need to maintain this list no matter what you do - so it's nothing extra
all_other_accounts = [ <list of accounts that this cdk deploys to> ]
account_principals = [iam.AccountPrincipal(a) for a in all_other_account]
role = iam.Role(
assumed_by = iam.CompositePrincipal(*account_principals), #auto-updated as you change the list above
role_name = some_explicit_name,
...
)
role_arn = f'arn:aws:iam::<account-of-this-stack>:role/{some_explicit_name}'
kms0 = kms.Key(...)
kms0.grant_decrypt(role)
# Because KMS also needs explicit resource policy even if role policy allows access to it
kms0.add_to_role_policy(iam.PolicyStatement(principals = [iam.ArnPrincipal(role_arn)], actions = ...))
kms1 = kms.Key(...)
kms1.grant_decrypt(role)
kms0.add_to_role_policy(... same as above ...)
secrets0 = secretsmanager.Secret(...) #maybe this is based off kms0
secrets0.grant_read(role)
secrets1 = secretsmanager.Secret(...) #maybe this is based off kms1
secrets1.grant_read(role)
# You can turn all this into a loop ofc.
ssm0 = ssm.StingParameter(self, '...', parameter_name = 'kms0_arn', string_value = kms0.key_arn, ...)
ssm0.grant_read(role)
ssm1 = ssm.StingParameter(self, '...', parameter_name = 'kms1_arn', string_value = kms1.key_arn, ...)
ssm1.grant_read(role)
ssm2 = ssm.StingParameter(self, '...', parameter_name = 'secrets0_arn', string_value = secrets0.secret_full_arn, ...)
ssm2.grant_read(role)
...
#Now simply pass around the role and ssm parameter names
for env in environments:
MyApplicationStage(self, <...>, ..., role_arn = role_arn, params = [ 'kms0_arn', 'kms1_arn', ... ], ...)
And then in the target stage(s):
for parm in params:
fn = AwsSdkCall('ssm', 'get_parameter', { "Name": param }, ...)
acr = AwsCustomResource(..., on_create = fn, on_update = fn, ...)
collect['param'] = acr.get_response_field('Parameter.Value')
Now do whatever you want with the collected artifacts, including supplying them as environment variables to your main service lambda (which will be resolved at deploy time).
Remember they will all be Tokens and resolved only at deploy time, but that's true of any resource, whether or not via custom-resource and it shouldn't matter.
That's a generic pattern which should work for any case.
(GitHub link where this question was asked and I had answered it there too)

How to structure container logs in Vertex AI?

I have a model in Vertex AI, from the logs it seems that Vertex AI has ingested the log into message field within jsonPayload field, but i would like to structure the jsonPayload field such that every key in message will be a field within jsonPayload, i.e: flatten/extract message
The logs in Stackdriver follow a defined LogEntry schema. Cloud Logging uses structured logs where log entries use the jsonPayload field to add structures to their payload.
For Vertex AI, the parameters are passed inside the message field which we see in the logs. These structures of the logs are predefined. However if you want to extract the fields that are present inside the message block you can refer to the below mentioned workarounds:
1. Create a sink :
You can export your logs to a Cloud Storage bucket, Bigquery,Pub/Sub etc.
If you use Bigquery as the sink, then in Bigquery you can use the JSON functions to extract the required data.
2. Download the logs and write your custom code :
You can download the log files and then write your custom logic to extract data as per your requirements.
You can refer to the client library (python) to write the custom logic and python JSON functions.
Using the gcloud logging client to write structure logs into a Vertex AI endpoint:
(Make sure you have a service account with premissions to write logs into gcloud, and, for clean logs make sure you don't stream any other logs into stderr or stdout)
import json
import logging
from logging import Handler, LogRecord
import google.cloud.logging_v2 as logging_v2
from google.api_core.client_options import ClientOptions
from google.oauth2 import service_account
data_to_write_to_endpoint = {key1: value1, ...}
#Json key for a Service account permitted to write logs into the gcp
# project where your endpoint is
credentials = service_account.Credentials.from_service_account_info(
json.loads(SERVICE_ACOUNT_KEY_JSON)
)
client = logging_v2.client.Client(
credentials=credentials, client_options=ClientOptions(api_endpoint="logging.googleapis.com",),
)
# This represent your Vertex AI endpoint
resource = logging_v2.Resource(
type="aiplatform.googleapis.com/Endpoint",
labels={"endpoint_id": YOUR_ENDPOINT_ID, "location": ENDPOINT_REGION},
)
logger = client.logger("LOGGER NAME")
logger.log_struct(
info=data_to_write_to_endpoint,
severity=severity,
resource=resource,
)

How to handle weird API flow with implicit create step in custom terraform provider

Most terraform providers demand a predefined flow, Create/Read/Update/Delete/Exists
I am in a weird situation developing a provider against an API where this behavior diverges a bit.
There are two kinds of resources, Host and Scope. A host can have many scopes. Scopes are updated with configurations.
This generally fits well into the terraform flow, it has a full CRUDE flow possible - except for one instance.
When a new Host is made, it automatically has a default scope attached to it. It is always there, cannot be deleted etc.
I can't figure out how to have my provider gracefully handle this, as I would want the tf to treat it like any other resource, but it doesn't have an explicit CREATE/DELETE, only READ/UPDATE/EXISTS - but every other scope attached to the host would have CREATE/DELETE.
Importing is not an option due to density, requiring an import for every host would render the entire thing pointless.
I originally was going to attempt to split Scopes and Configurations into separate resources so one could be full-filled by the Host (the host providing the Scope ID for a configuration, and then other configurations can get their scope IDs from a scope resource)
However this approach falls apart because the API for both are the same, unless I wanted to add the abstraction of creating an empty scope then applying a configuration against it, which may not be fully supported. It would essentially be two resources controlling one resource which could lead to dramatic conflicts.
A paraphrased example of an execution I thought about implementing
resource "host" "test_integrations" {
name = "test.integrations.domain.com"
account_hash = "${local.integrationAccountHash}"
services = [40]
}
resource "configuration" "test_integrations_root_configuration" {
name = "root"
parent_host = "${host.test_integrations.id}"
account_hash = "${local.integrationAccountHash}"
scope_id = "${host.test_integrations.root_scope_id}"
hostnames = ["test.integrations.domain.com"]
}
resource "scope" "test_integrations_other" {
account_hash = "${local.integrationAccountHash}"
host_hash = "${host.test_integrations.id}"
path = "/non/root/path"
name = "Some Other URI Path"
}
resource "configuration" "test_integrations_other_configuration" {
name = "other"
parent_host = "${host.test_integrations.id}"
account_hash = "${local.integrationAccountHash}"
scope_id = "${host.test_integrations_other.id}"
}
In this example flow, a configuration and scope resource unfortunately are pointing to the same resource which I am worried would cause conflicts or confusion on who is responsible for what and dramatically confuses the create/delete lifecycle
But I can't figure out how the TF lifecycle would allow for a resource that would only UPDATE/READ/EXISTS if say a flag was given (and how state would handle that)
An alternative would be to just have a Configuration resource, but then if it was the root configuration it would need to skip create/delete as it is inherently tied to the host
Ideally I'd be able to handle this situation gracefully. I am trying to avoid including the root scope/configuration in the host definition as it would create a split in how they are written and handled.
The documentation for providers implies you can use a resource AS a schema object in a resource, but does not explain how or why. If it works the way I imagine it, it may work to create a resource that is only used to inject into the host perhaps - but I don't know if that is how it works and if it is how to accomplish it.
I believe I tentatively have found a solution after asking some folks on the gopher slack.
Using AWS Provider Default VPC as a reference, I can "clone" the resource into one with a custom Create/Delete lifecycle
Loose Example:
func defaultResourceConfiguration() *schema.Resource {
drc := resourceConfiguration()
drc.Create = resourceDefaultConfigurationCreate
drc.Delete = resourceDefaultConfigurationDelete
return drc
}
func resourceDefaultConfigurationCreate(d *schema.ResourceData, m interface{}) error {
// double check it exists and update the resource instead
return resourceConfigurationUpdate(d, m)
}
func resourceDefaultConfigurationDelete(d *schema.ResourceData, m interface{}) error {
log.Printf("[WARN] Cannot destroy Default Scope Configuration. Terraform will remove this resource from the state file, however resources may remain.")
return nil
}
This should allow me to provide an identical resource that is designed to interact with the already existing one created by its parent host.

Provision custom-named SQS Queue with PCF Service Broker

I'm trying to create a new queue, but when using
cf create-service aws-sqs standard my-q
the name of the queue in AWS is automatically assigned and is just an id composed of random letters and numbers.
This is fine when using the normal java client. However, we want to use spring-cloud-aws-messaging (#SqsListener annotation), because it offers us deletion policies out of the box, and a way to extend visibility, so that we can implement retries easily.
#SqsListener(value = "my-q", deletionPolicy = SqsMessageDeletionPolicy.ON_SUCCESS)
public void listen(TestItem item, Visibility visibility) {
log.info("received message: " + item);
//do business logic
//if call fails
visibility.extend(1000);
//throw exception
//if no failure, message will be dropped
}
The queue name on the annotation is declared, so we can't change it dynamically after reading the VCAP_SERVICE environment variable injected by PCF on the application.
The only alternative we can think of is use reflection to set accessibility on value of the annotation, and set the value to the name on the VCAP_SERVICE, but that's just nasty, and we'd like to avoid it if possible.
Is there any way to change the name of the queue to something specific on creation? This suggests that it's possible, as seen below:
cf create-service aws-sqs standard my-q -c '{ "CreateQueue": { "QueueName": “my-q”, "Attributes": { "MaximumMessageSize": "1024"} } }'
However, this doesn't work. It returns:
Incorrect Usage: Invalid configuration provided for -c flag. Please
provide a valid JSON object or path to a file containing a valid JSON
object.
How do I set the name on creation of the queue? Or the only way to achieve my end goal is to use reflection?
EDIT: As pointed out by Daniel Mikusa, the double quotes were not real double quotes, and that was causing the error. The command is successful now, however it doesn't create the queue with the intended name. I'm now wondering if this name needs to be set on bind-service instead. The command has a -c option too but I cannot find any documentation to support which parameters are available for a aws-sqs service.

spring-integration-aws dynamic file download

I've a requirement to download a file from S3 based on a message content. In other words, the file to download is previously unknown, I've to search and find it at runtime. S3StreamingMessageSource doesn't seem to be a good fit because:
It relies on polling where as I need to wait for the message.
I can't find any way to create a S3StreamingMessageSource dynamically in the middle of a flow. gateway(IntegrationFlow) looks interesting but what I need is a gateway(Function<Message<?>, IntegrationFlow>) that doesn't exist.
Another candidate is S3MessageHandler but it has no support for listing files which I need for finding the desired file.
I can implement my own message handler using AWS API directly, just wondering if I'm missing something, because this doesn't seem like an unusual requirement. After all, not every app just sits there and keeps polling S3 for new files.
There is S3RemoteFileTemplate with the list() function which you can use in the handle(). Then split() result and call S3MessageHandler for each remote file to download.
Although the last one has functionality to download the whole remote dir.
For anyone coming across this question, this is what I did. The trick is to:
Set filters later, not at construction time. Note that there is no addFilters or getFilters method, so filters can only be set once, and can't be added later. #artem-bilan, this is inconvenient.
Call S3StreamingMessageSource.receive manually.
.handle(String.class, (fileName, h) -> {
if (messageSource instanceof S3StreamingMessageSource) {
S3StreamingMessageSource s3StreamingMessageSource = (S3StreamingMessageSource) messageSource;
ChainFileListFilter<S3ObjectSummary> chainFileListFilter = new ChainFileListFilter<>();
chainFileListFilter.addFilters(
new S3SimplePatternFileListFilter("**/*/*.json.gz"),
new S3PersistentAcceptOnceFileListFilter(metadataStore, ""),
new S3FileListFilter(fileName)
);
s3StreamingMessageSource.setFilter(chainFileListFilter);
return s3StreamingMessageSource.receive();
}
log.warn("Expected: {} but got: {}.",
S3StreamingMessageSource.class.getName(), messageSource.getClass().getName());
return messageSource.receive();
}, spec -> spec
.requiresReply(false) // in case all messages got filtered out
)

Resources