I have an MSK cluster setup on 2 public subnets in eu-west-1 region. I have a topic named 'awsuseravro1' and Glue Schema Registry with name 'msk-registry' and schema name same as the topic name. I have created an MSK connector with a custom plugin which contains latest kafka-connect-s3 plugin. I also manually copied to the lib folder of this plugin schema-registry-kafkaconnect-converter 1.1.5 version in order for aws glue avro converter configs to work. I uploaded this custom zip to s3, created a custom plugin and launched a connector with following configs:
connector.class=io.confluent.connect.s3.S3SinkConnector
s3.region=eu-west-1
flush.size=1
schema.compatibility=FULL
tasks.max=2
topics=awsuseravro1
format.class=io.confluent.connect.s3.format.avro.AvroFormat
partitioner.class=io.confluent.connect.storage.partitioner.DefaultPartitioner
storage.class=io.confluent.connect.s3.storage.S3Storage
s3.bucket.name=hfs3testing
topics.dir=avrotopics1
key.converter=com.amazonaws.services.schemaregistry.kafkaconnect.AWSKafkaAvroConverter
value.converter=com.amazonaws.services.schemaregistry.kafkaconnect.AWSKafkaAvroConverter
key.converter.region=eu-west-1
value.converter.region=eu-west-1
key.converter.schemaAutoRegistrationEnabled=true
value.converter.schemaAutoRegistrationEnabled=true
key.converter.avroRecordType=GENERIC_RECORD
value.converter.avroRecordType=GENERIC_RECORD
key.converter.schemaName=awsuseravro1
value.converter.schemaName=awsuseravro1
key.converter.registry.name=msk-registry
value.converter.registry.name=msk-registry
My msk cluster has no authentication and is plaintext. The connector starts but I get a following error:
https://pastebin.com/zUAXgsDP
I also gave the connector role all the required s3 policies and full access to aws glue registry. When I run some custom producers and consumers that I wrote on my EC2 instance which use the glue registry it works great. When I use S3 connector with JSON format without glue registry with plain text topic it saves to the bucket normally
Related
I want to configure an Oracle Se instance on the elastic beanstalk environment, and I have chosen the oracle-se from the selection box, but the related select option of version and instance class does not be updated.
I have chosen the db.t2.micro (due to free tier usage), and it shows I have selected the db.m1.small. Then, it keeps prompting the error message when I save that configuration.
For example:
Unable to retrieve RDS configuration options.
Configuration validation exception: Invalid option value:
'db.m1.small' (Namespace: 'aws:rds:dbinstance', OptionName:
'DBInstanceClass'): DBInstanceClass db.m1.small not supported for
oracle-se1 db engine.
Sample image with related error message
I have also searched the other stack overflow forum like Unable to add an RDS instance to Elastic Beanstalk, and those forum has stated that AWS has resolved this problem, but it does not work for me.
AWS has requested that the product I'm working on identifies requests that it makes to our users' S3 resources on their behalf so they can assess its impact.
To accomplish this, we have to set the User-Agent header for every upload request done against a S3 bucket from an EMR application. I'm wondering how this can be achieved?
Hadoop's doc mentions the fs.s3a.user.agent.prefix property (core-default.xml). However, the protocol s3a seems to be deprecated (Work with Storage and File Systems), so I'm not sure if this property will work.
To give a bit of more context what I need to do, with AWS Java SDK, it is possible to set the User-Agent header's prefix, for example:
AWSCredentials credentials;
ClientConfiguration conf = new ClientConfiguration()
.withUserAgentPrefix("APN/1.0 PARTNER/1.0 PRODUCT/1.0");
AmazonS3Client client = new AmazonS3Client(credentials, conf);
Then, every request's User-Agent http header will has a value similar to: APN/1.0 PARTNER/1.0 PRODUCT/1.0, aws-sdk-java/1.11.234 Linux/4.15.0-58-generic Java_HotSpot(TM)_64-Bit_Server_VM/25.201-b09 java/1.8.0_201. I need to achieve something similar when uploading files from an EMR application.
S3A is not deprecated in ASF hadoop; I will argue that it is now ahead of what EMR's own connector will do. If you are using EMR you may be able to use it, otherwise you get to work with what they implement.
FWIW in S3A we're looking at what it'd take to actually dynamically change the header for a specific query, so you go beyond specific users to specific hive/spark queries in shared clusters. Be fairly complex to do this though as you need to do it on a per request setting.
The solution in my case was to include a awssdk_config_default.json file inside the JAR submitted to EMR job. This file it used by AWS SDK to allow developers to override some custom settings.
I've added this json file within the JAR submitted to EMR with this content:
{
"userAgentTemplate": "APN/1.0 PARTNER/1.0 PRODUCT/1.0 aws-sdk-{platform}/{version} {os.name}/{os.version} {java.vm.name}/{java.vm.version} java/{java.version}{language.and.region}{additional.languages} vendor/{java.vendor}"
}
Note: passing the fs.s3a.user.agent.prefix property to EMR job didn't work. AWS EMR uses EMRFS when handling files stored in S3 which uses AWS SDK. I realized it because of an exception thrown in AWS EMR that I see sometimes, part of its stack trace was:
Caused by: java.lang.ExceptionInInitializerError: null
at com.amazon.ws.emr.hadoop.fs.files.TemporaryDirectoriesGenerator.createAndTrack(TemporaryDirectoriesGenerator.java:144)
at com.amazon.ws.emr.hadoop.fs.files.TemporaryDirectoriesGenerator.createTemporaryDirectories(TemporaryDirectoriesGenerator.java:93)
at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.create(S3NativeFileSystem.java:616)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:932)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:825)
at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.create(EmrFileSystem.java:217)
at org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:135)
I'm posting the answer here to future references. Some interests links:
The class in AWS SDK that uses this configuration file: InternalConfig.java
https://stackoverflow.com/a/31173739/1070393
EMRFS
I have turned on database activity events which I think is some kind of log file on AWS Aurora. They are currently being passed through AWS kinesis into s3 via AWS Firehose. The log in s3 looks like this:
{"type":"DatabaseActivityMonitoringRecords","version":"1.0","databaseActivityEvents":"AYADeOC+7S/mFpoYLr17gZCXuq8AXwABABVhd3MtY3J5cHRvLXB1YmxpYy1rZXkAREFvbjhIZ01uQTVpVHlyS0l3NnVIOS9xdXF3OWEza0xZV0c2QXYzQmtWUFI2alpIK2hsczNwalAyTTIzYnpPS2RXUT09AAEAAkJDABtEYXRhS2V5AAAAgAAAAAwzb2YKNe4h6b2CpykAMLzY7gDftUKUr3QxmxSzylw9qCRxnGW9Fn1qL4uKnbDV/PE44WyOQbXKGXv9s8BxEwIAAAAADAAAEAAAAAAAAAAAAAAAAAC+gU55u4hvWxW1RG/FNNSJ/////wAAAAEAAAAAAAAAAAAAAAEAAACtbmBmDwZw2/1rKiwA4Nyl7cm19/RcHhCpMMwbOFFkZHKL/bvsohf5T+yM9vNxCgAi2qTUIEe17VA5bJ0eCcNAA9mb6Ys+PR1w7QhKrQsHHTBC2dhJ4ELwpXamGRmPLga5Dml2rOveA59YefcJ4PhrqztZXfrS8fBYJ3HgBWHY9nPh1jdyinjQAl61hQrz2LPII85zlqAWTNeL2pXwaRdtGdYeIXXoh4VsoV3Q18Hj/uOQzTIbT8EJvwnk0gj8AGcwZQIxAJNuoCJhHPUfbkk0fHF6HYz1STIc4HX2HOl0qSIHqwpgtQK6BMa3YlPI9hNwhB8x+AIwWDY0bMjuLRGQgjjBv5z1xPpZQ+pMZ4K6m9JaNBFVKxZTvqDL1z7lrV0rlbZThad+","key":"AQIDAHhQgnMAiP8TEQ3/r+nxwePP2VOcLmMGvmFXX8om3hCCugE7IUxSH/eJBEKvnkYoNIqFAAAAfjB8BgkqhkiG9w0BBwagbzBtAgEAMGgGCSqGSIb3DQEHATAeBglghkgBZQMEAS4wEQQMQIX97gE5ioBR1+nnAgEQgDuDX2B2T7nOxjKDyL31+wHJb0pwkCeaU7CwA6BwIkiT7FmhMB71XgvCVrY9C9ABUtc1e5J7QIfsVB214w=="}
I think a KMS key is being used to encrypt that log file. How do I decrypt it? Is there working sample code somewhere? Also, more importantly, the Aurora database I'm using is a test database with no activity (no inserts, selects, updates). Why are there so many logs? Why are there so many databaseActivityEvents. They seem to be getting written to s3 every minute of the day.
Yes it uses RDS Activity stream KMS key (ActivityStreamKmsKeyId) for encrypting the log event and also base64 encoding. You will have to make use of AWS cryptographic SDKs to decrypt the key and the log event.
For reference see below their the sample java and python versions:
Processing a Database Activity Stream using the AWS SDK
In your firehose pipeline you can add transformation with Lambda step and do this decryption in your lambda function.
Why there are so many events in idle postgres RDS cluster? They are heartbeat events.
When you decrypt and take a look at the actual activity event json, it has type field which can be either be record or heartbeat. Events with type record are the user activity generated ones.
I am very new to Pinterest Secor and I am trying to set it up for my local Kafka topic. I tried reading the official GitHub documentation but it does not explain the complete steps to be followed and which fields should be modified in the config files and what do they mean. It would be great if someone can direct me to a useful link.
Here is the link to set-up Secor, as per the official Secor documentation.
Secor doc link
Some major steps are:-
1.To create a Java class to define your partitions. These partitions
are the S3 or Azure paths/folder structures where you would like to
upload your data.
In the config file -> https://github.com/pinterest/secor/blob/master/src/main/config/secor.prod.partition.properties
Update the class name to your class name
Set the kafka group name in the property secor.kafka.group
Set the directory where you will keep the secor data temporarily before it is uploaded to S3/Azure-blob. update this
property ->secor.local.path
Update the base path of your S3 location. Update this property -> secor.s3.path
set kafka host from where you want to consume data -> kafka.seed.broker.host
set zookeeper ip and port that manages your kafka -> zookeeper.quorum
set the time after which your file is uploaded to s3 -> secor.max.file.age.seconds
Our company is migrating from s3 to GCS. While the command-line utility gsutil works fine, I am facing difficulty in configuring Hadoop (core-site.xml) to enable access to GCS. This google page https://storage.googleapis.com/hadoop-conf/gcs-core-default.xml lists the name-value pairs that need to be added, but I don't find any of these in the ~/.boto file. The .boto file only has the following set:
gs_oauth2_refresh_token under [Credentials]
default_project_id under [GSUtil]
Few others like api_version etc..
The [OAuth2] section is empty.
Can I somehow generate the necessary keys using gs_oauth2_refresh_token and add them to Hadoop config? Or can I get these from any other gsutil config files?
For hadoop configuration you'll likely want to use a service-account rather than gsutil credentials that are associated with an actual email address; see these instructions for manual installation of the GCS connector for more details about setting up a p12 keyfile along with the other necessary configuration parameters.