Steps to setup Secor for a Kafa Topic - pinterest

I am very new to Pinterest Secor and I am trying to set it up for my local Kafka topic. I tried reading the official GitHub documentation but it does not explain the complete steps to be followed and which fields should be modified in the config files and what do they mean. It would be great if someone can direct me to a useful link.

Here is the link to set-up Secor, as per the official Secor documentation.
Secor doc link
Some major steps are:-
1.To create a Java class to define your partitions. These partitions
are the S3 or Azure paths/folder structures where you would like to
upload your data.
In the config file -> https://github.com/pinterest/secor/blob/master/src/main/config/secor.prod.partition.properties
Update the class name to your class name
Set the kafka group name in the property secor.kafka.group
Set the directory where you will keep the secor data temporarily before it is uploaded to S3/Azure-blob. update this
property ->secor.local.path
Update the base path of your S3 location. Update this property -> secor.s3.path
set kafka host from where you want to consume data -> kafka.seed.broker.host
set zookeeper ip and port that manages your kafka -> zookeeper.quorum
set the time after which your file is uploaded to s3 -> secor.max.file.age.seconds

Related

KStream disable local state strore

I am using Kafka Stream with Spring cloud Stream. Our application is stateful as it does some aggregation. When I run the app, I see the below ERROR message on the console.
I am running this app in a Remote Desktop Windows machine.
Failed to change permissions for the directory C:\Users\andy\project\tmp
Failed to change permissions for the directory C:\Users\andy\project\tmp\my-local-local
But when the same code is deployed in a Linux box, I don't see the error. So I assume it an access issue.
As per our company policy, we do not have access to the change a folder's permission and hence chmod 777 did not work as well.
My question is, is there a way to disable creating the state store locally and instead use the Kafka change log topic to maintain the state. I understand this is not ideal, but it only for my local development. TIA.
You could try to use in-memory state stores instead of the default persistent state stores.
You can do that by providing a state store supplier for in-memory state stores to your stateful operations:
StateStoreSupplier storeSupplier = Stores.inMemoryKeyValueStore("in-mem");
StreamsBuilder builder = stream("input-topic")
.groupByKey()
.count(Materialized.as(storeSupplier))
From Apache Kafka 3.2 onwards, you can set the store type in the stateful operation without the need for a state store supplier:
StreamsBuilder builder = stream("input-topic")
.groupByKey()
.count(Materialized.withStoreType(StoreType.IN_MEMORY))
Or you can set the state store type globally with:
props.put(StreamsConfig.DEFAULT_DSL_STORE_CONFIG, StreamsConfig.IN_MEMORY);

How to set User-Agent (prefix) for every upload request to S3 from Amazon EMR application

AWS has requested that the product I'm working on identifies requests that it makes to our users' S3 resources on their behalf so they can assess its impact.
To accomplish this, we have to set the User-Agent header for every upload request done against a S3 bucket from an EMR application. I'm wondering how this can be achieved?
Hadoop's doc mentions the fs.s3a.user.agent.prefix property (core-default.xml). However, the protocol s3a seems to be deprecated (Work with Storage and File Systems), so I'm not sure if this property will work.
To give a bit of more context what I need to do, with AWS Java SDK, it is possible to set the User-Agent header's prefix, for example:
AWSCredentials credentials;
ClientConfiguration conf = new ClientConfiguration()
.withUserAgentPrefix("APN/1.0 PARTNER/1.0 PRODUCT/1.0");
AmazonS3Client client = new AmazonS3Client(credentials, conf);
Then, every request's User-Agent http header will has a value similar to: APN/1.0 PARTNER/1.0 PRODUCT/1.0, aws-sdk-java/1.11.234 Linux/4.15.0-58-generic Java_HotSpot(TM)_64-Bit_Server_VM/25.201-b09 java/1.8.0_201. I need to achieve something similar when uploading files from an EMR application.
S3A is not deprecated in ASF hadoop; I will argue that it is now ahead of what EMR's own connector will do. If you are using EMR you may be able to use it, otherwise you get to work with what they implement.
FWIW in S3A we're looking at what it'd take to actually dynamically change the header for a specific query, so you go beyond specific users to specific hive/spark queries in shared clusters. Be fairly complex to do this though as you need to do it on a per request setting.
The solution in my case was to include a awssdk_config_default.json file inside the JAR submitted to EMR job. This file it used by AWS SDK to allow developers to override some custom settings.
I've added this json file within the JAR submitted to EMR with this content:
{
"userAgentTemplate": "APN/1.0 PARTNER/1.0 PRODUCT/1.0 aws-sdk-{platform}/{version} {os.name}/{os.version} {java.vm.name}/{java.vm.version} java/{java.version}{language.and.region}{additional.languages} vendor/{java.vendor}"
}
Note: passing the fs.s3a.user.agent.prefix property to EMR job didn't work. AWS EMR uses EMRFS when handling files stored in S3 which uses AWS SDK. I realized it because of an exception thrown in AWS EMR that I see sometimes, part of its stack trace was:
Caused by: java.lang.ExceptionInInitializerError: null
at com.amazon.ws.emr.hadoop.fs.files.TemporaryDirectoriesGenerator.createAndTrack(TemporaryDirectoriesGenerator.java:144)
at com.amazon.ws.emr.hadoop.fs.files.TemporaryDirectoriesGenerator.createTemporaryDirectories(TemporaryDirectoriesGenerator.java:93)
at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.create(S3NativeFileSystem.java:616)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:932)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:825)
at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.create(EmrFileSystem.java:217)
at org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:135)
I'm posting the answer here to future references. Some interests links:
The class in AWS SDK that uses this configuration file: InternalConfig.java
https://stackoverflow.com/a/31173739/1070393
EMRFS

Disabling/Pause database replication using ML-Gradle

I want to disable the Database Replication from the replica cluster in MarkLogic 8 using ML-Gradle. After updating the configurations, I also want to re-enable it.
There are tasks for enabling and disabling flexrep in ML Gradle. But I couldn't found any such thing for Database Replication. How can this be done?
ml-gradle uses the Management API to handle configuration changes. Database Replication is controlled by sending a PUT command to /manage/v2/databases/[id-or-name]/properties. Update your ml-config/databases/content-database.json file (example that does not include that property) to include database-replication, including replication-enabled: true.
To see what that object should look like, you can send a GET request to the properties endpoint.
You can create your own command to set replication-enabled - see https://github.com/rjrudin/ml-gradle/wiki/Writing-your-own-management-task
I'll also add a ticket for making official commands - e.g. mlEnableReplication and mlDisableReplication, with those defaulting to the content database, and allowing for any database to be specified.

Need to Create File inbound channel adapter dynamically

I am new in spring integration. So please help me to resolve this problem.
Requirement is something like, we have multiple location from where we have to read the files and in future it can increase, location can be Any file system, so I am trying to use file inbound channel adapter to complete this requirement.
we can have multiple locations stored in our data base like pooling time and location from where we have to pool to get the files.
But if I go with xml configuration so I have to create new file inbound channel adapter every time in the xml configuration and all the details. if we want to pool that specific location to get the files. something like below -
int-file:inbound-channel-adapter id="AdapterOne" prevent-duplicates="false"
directory="${FileInputLoc}" filter="compositeFilter"
int:poller fixed-rate="${poolingTime}"
int-file:inbound-channel-adapter
int:service-activator input-channel="AdapterOne" ref="testbean"
bean id="testbean" class="com.SomeActivatorClass"
Please suggest me, how can I achieve this by code. so that based on data base row it create different channel adapter which pool at different time to different location.
Thanks,
Ashu

How to enable hadoop access to Google Cloud using .boto file

Our company is migrating from s3 to GCS. While the command-line utility gsutil works fine, I am facing difficulty in configuring Hadoop (core-site.xml) to enable access to GCS. This google page https://storage.googleapis.com/hadoop-conf/gcs-core-default.xml lists the name-value pairs that need to be added, but I don't find any of these in the ~/.boto file. The .boto file only has the following set:
gs_oauth2_refresh_token under [Credentials]
default_project_id under [GSUtil]
Few others like api_version etc..
The [OAuth2] section is empty.
Can I somehow generate the necessary keys using gs_oauth2_refresh_token and add them to Hadoop config? Or can I get these from any other gsutil config files?
For hadoop configuration you'll likely want to use a service-account rather than gsutil credentials that are associated with an actual email address; see these instructions for manual installation of the GCS connector for more details about setting up a p12 keyfile along with the other necessary configuration parameters.

Resources