Support for Flink ACL in YARN - hadoop

In a secured Hadoop cluster I am trying to access Flink AM page and logs from YARN and seeing the following error:
User %remoteUser are not authorized to view application %appID
Seems like that the cause is lack of support of ACL in YARN from Flink side.
How the code works
The message comes from hadoop/yarn/server/AppBlock class which uses ApplicationACLsManager class. This class performs checks and refers to app info which was set in RMAppManager:
this.applicationACLsManager.addApplication(applicationId,
submissionContext.getAMContainerSpec().getApplicationACLs()
AMContainerSpec is ContainerLaunchContext class which has a PB implementation, submitted from the framework side.
From Flink, this object is created in AbstractYarnClusterDescriptor class which (and other classes in Flink) doesn't call setApplicationACLs.
Question
Is there a way to bypass this or the right solution is to contribute the support to Flink? What is the state of this feature from the Flink side?

This sounds like a limitation in Flink which we should fix. Please open a JIRA issue. The community would be very happy if you could help implementing it.

Related

How to set User-Agent (prefix) for every upload request to S3 from Amazon EMR application

AWS has requested that the product I'm working on identifies requests that it makes to our users' S3 resources on their behalf so they can assess its impact.
To accomplish this, we have to set the User-Agent header for every upload request done against a S3 bucket from an EMR application. I'm wondering how this can be achieved?
Hadoop's doc mentions the fs.s3a.user.agent.prefix property (core-default.xml). However, the protocol s3a seems to be deprecated (Work with Storage and File Systems), so I'm not sure if this property will work.
To give a bit of more context what I need to do, with AWS Java SDK, it is possible to set the User-Agent header's prefix, for example:
AWSCredentials credentials;
ClientConfiguration conf = new ClientConfiguration()
.withUserAgentPrefix("APN/1.0 PARTNER/1.0 PRODUCT/1.0");
AmazonS3Client client = new AmazonS3Client(credentials, conf);
Then, every request's User-Agent http header will has a value similar to: APN/1.0 PARTNER/1.0 PRODUCT/1.0, aws-sdk-java/1.11.234 Linux/4.15.0-58-generic Java_HotSpot(TM)_64-Bit_Server_VM/25.201-b09 java/1.8.0_201. I need to achieve something similar when uploading files from an EMR application.
S3A is not deprecated in ASF hadoop; I will argue that it is now ahead of what EMR's own connector will do. If you are using EMR you may be able to use it, otherwise you get to work with what they implement.
FWIW in S3A we're looking at what it'd take to actually dynamically change the header for a specific query, so you go beyond specific users to specific hive/spark queries in shared clusters. Be fairly complex to do this though as you need to do it on a per request setting.
The solution in my case was to include a awssdk_config_default.json file inside the JAR submitted to EMR job. This file it used by AWS SDK to allow developers to override some custom settings.
I've added this json file within the JAR submitted to EMR with this content:
{
"userAgentTemplate": "APN/1.0 PARTNER/1.0 PRODUCT/1.0 aws-sdk-{platform}/{version} {os.name}/{os.version} {java.vm.name}/{java.vm.version} java/{java.version}{language.and.region}{additional.languages} vendor/{java.vendor}"
}
Note: passing the fs.s3a.user.agent.prefix property to EMR job didn't work. AWS EMR uses EMRFS when handling files stored in S3 which uses AWS SDK. I realized it because of an exception thrown in AWS EMR that I see sometimes, part of its stack trace was:
Caused by: java.lang.ExceptionInInitializerError: null
at com.amazon.ws.emr.hadoop.fs.files.TemporaryDirectoriesGenerator.createAndTrack(TemporaryDirectoriesGenerator.java:144)
at com.amazon.ws.emr.hadoop.fs.files.TemporaryDirectoriesGenerator.createTemporaryDirectories(TemporaryDirectoriesGenerator.java:93)
at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.create(S3NativeFileSystem.java:616)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:932)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:825)
at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.create(EmrFileSystem.java:217)
at org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:135)
I'm posting the answer here to future references. Some interests links:
The class in AWS SDK that uses this configuration file: InternalConfig.java
https://stackoverflow.com/a/31173739/1070393
EMRFS

Is the GBMV3 object from the H2O server different from the GBMV3 class in the H2O library?

We are working with H2O version 3.22.0.1. We created a process in java 10 that communicates with the REST API utilizing jersey version 2.27 with gson 2.3.1. The process invokes ImportFiles, followed by ParseSetup and Parse. Everything works well up until that point. Then the process invokes 3/ModelBuilders/gbm/parameters. From examining the log, it appears that the H2O server responds as expected. However, gson throws a JsonSyntaxException caused by the following:
java.lang.IllegalStateException: Expected BEGIN_OBJECT but was BEGIN_ARRAY at line 1 column 4115 path $.parameters
Upon further analysis, it appears that the H2O server is providing a GBMV3 object with an array of ModelParameterSchemaV3 objects, while the GBMV3 class, as defined in the library that our client uses, extends SharedTreeV3, which extends ModelBuilderSchema, which has a single instance of ModelParametersSchemaV3. There is an apparent discrepancy between the way the GBMV3 object provided by the H2O server is composed, and the way the class is defined in the H2O library. One has an array of ModelParameterSchemaV3 objects, while the other has a single instance of ModelParametersSchemaV3. Is that the case? If so, could you please help us understand what we may be doing wrong, and how to correct it?
See the files located at: https://1drv.ms/f/s!AsSlPHvlhJI1hIpB2M5X49J5L-h1qw
Run the H2O server. Import the CSV file in H2O Flow. SetupParse and Parse the data. Run the test procedure. Thank you for your kind assistance.
Thanks for the detailed description. To better understand your problem - would you be able to provide a simplified example of how you are calling H2O-3 using the Java bindings?
You might be hitting a bug so if you are able to give us a reproducer we could expedite a fix for this issue.

Logging for Talend job running within spring-boot

We have talend-jobs triggered within Spring-boot application. Is there any way to configure the output of talend-jobs to the application log files?
One workaround we find is to write logs directly to an external file (filePath passed as context-param). But wanted to find if there is a better way to configure this seamlessly.
Not sure if I understood the question correctly, but I guess your concerns might be on what might have happened to the triggered Jobs.
Logging
With Respect to Logging for Talend, You could configure using Log4j,
https://help.talend.com/reader/5DC~TBhDsBie5JTXyVLW4g/QSGCZJKXo~uhKvZDq1DxUg
Monitoring
Regarding the Status of the Job Executed, you could get the execution details retrieved using REST Call(Talend Metaservlet API).
getTaskExecutionStatus
https://help.talend.com/reader/oYf9gKhmYrkWCiSua4qLeg/SLiAyHyDTjuznLR_F~MiQQ
By Modifying the Existing Talend Job,You could also design a like a feedback loop, ie Trigger a REST Call back to your application. With the details of Execution from Talend Job.

How do you use s3a with spark 2.1.0 on aws us-east-2?

Background
I have been working on getting a flexible setup for myself to use spark on aws with docker swarm mode. The docker image I have been using is configured to use the latest spark, which at the time is 2.1.0 with Hadoop 2.7.3, and is available at jupyter/pyspark-notebook.
This is working, and I have been just going through to test out the various connectivity paths that I plan to use. The issue I came across is with the uncertainty around the correct way to interact with s3. I have followed the trail on how to provide the dependencies for spark to connect to data on aws s3 using the s3a protocol, vs s3n protocol.
I finally came across the hadoop aws guide and thought I was following how to provide the configuration. However, I was still receiving the 400 Bad Request error, as seen in this question that describes how to fix it by defining the endpoint, which I had already done.
I ended up being too far off the standard configuration by being on us-east-2, making me uncertain if I had a problem with the jar files. To eliminate the region issue, I set things back up on the regular us-east-1 region, and I was able to finally connect with s3a. So I have narrowed down the problem to the region, but thought I was doing everything required to operate on the other region.
Question
What is the correct way to use the configuration variables for hadoop in spark to use us-east-2?
Note: This example uses local execution mode to simplify things.
import os
import pyspark
I can see in the console for the notebook these download after creating the context, and adding these took me from being completely broken, to getting the Bad Request error.
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3 pyspark-shell'
conf = pyspark.SparkConf('local[1]')
sc = pyspark.SparkContext(conf=conf)
sql = pyspark.SQLContext(sc)
For the aws config, I tried both the below method and by just using the above conf, and doing conf.set(spark.hadoop.fs.<config_string>, <config_value>) pattern equivalent to what I do below, except doing it this was I set the values on conf before creating the spark context.
hadoop_conf = sc._jsc.hadoopConfiguration()
hadoop_conf.set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
hadoop_conf.set("fs.s3a.endpoint", "s3.us-east-2.amazonaws.com")
hadoop_conf.set("fs.s3a.access.key", access_id)
hadoop_conf.set("fs.s3a.secret.key", access_key)
One thing to note, is that I also tried an alternative endpoint for us-east-2 of s3-us-east-2.amazonaws.com.
I then read some parquet data off of s3.
df = sql.read.parquet('s3a://bucket-name/parquet-data-name')
df.limit(10).toPandas()
Again, after moving the EC2 instance to us-east-1, and commenting out the endpoint config, the above works for me. To me, it seems like endpoint config isn't being used for some reason.
us-east-2 is a V4 auth S3 instance so, as you attemped, the fs.s3a.endpoint value must be set.
if it's not being picked up then assume the config you are setting isn't the one being used to access the bucket. Know that Hadoop caches filesystem instances by URI, even when the config changes. The first attempt to access a filesystem fixes, the config, even when its lacking in auth details.
Some tactics
set the value is spark-defaults
using the config you've just created, try to explicitly load the filesystem via a call to Filesystem.get(new URI("s3a://bucket-name/parquet-data-name", myConf) will return the bucket with that config (unless it's already there). I don't know how to make that call in .py though.
set the property "fs.s3a.impl.disable.cache" to true to bypass the cache before the get command
Adding more more diagnostics on BadAuth errors, along with a wiki page, is a feature listed for S3A phase III. If you were to add it, along with a test, I can review it and get it in

How does one run Spring XD in distributed mode?

I'm looking to start Spring XD in distributed mode (more specifically deploying it with BOSH). How does the admin component communicate to the module container?
If it's via TCP/HTTP, surely I'll have to tell the admin component where all the containers are? If it's via Redis, I would've thought that I'll need to tell the containers where the Redis instance is?
Update
I've tried running xd-admin and Redis on one box, and xd-container on another with redis.properties updated to point to the admin box. The container starts without reporting any exceptions.
Running the example stream submission curl -d "time | log" http://{admin IP}:8080/streams/ticktock yields no output to either console, and not output to the logs.
If you are using the xd-container script, then the redis.properties is expected to be under "XD_HOME/config" where XD_HOME points the base directory where you have bin, config, lib & modules of xd.
Communication between the Admin and Container runtime components is via the messaging bus, which by default is Redis.
Make sure the environment variable XD_HOME is set as per the documentation; if it is not you will see a logging message that suggests the properties file has been loaded correctly when it has not:
13/06/24 09:20:35 INFO support.PropertySourcesPlaceholderConfigurer: Loading properties file from URL [file:../config/redis.properties]

Resources