Cannot view the Spark UI using cloudformation stack - amazon-ec2

I want to enable the spark ui for my glue jobs. I followed Enabling the Spark UI for Jobs and Launching the Spark History Server, which I used default yml file provided by this guide to launch stack on cloudformation. After the stack was CREATE_COMPLETE, I got the SparkUiPublicUrl from the Outputs of the stack, but I couldn't open this url, which means I couldn't view the spark ui. The thing I have tried was to modify the inbound rules of security group, which is as following, but still not working. Have I missed something here?
inbound rules

Related

How to configure Application Logging Service for SCP application

I have created the hello world application from the SAP Cloud SDK archetypes and pushed this to the cloud foundry environment, binding it to an application logging service instance. My understanding is that this should already provide me with the ability to analyze all logs in the Kibana dashboard of the cloud platform and previously it also worked this way.
However, this time the Kibana dashboard remains empty, so I am wondering if I missed a step or configuration. Looking at the documentation of the service and the respective tutorial blog, I was not able to identify any additional required steps. In the Logs view on the SCP cockpit I can definitely see the entries, but they are not replicated to the ELK stack in the background.
Problem was not SDK related, but seems to have been an incident on the SCP - now works correctly without any changes.

Apache NiFi deployment in production

I am new to Apache NiFi. From the documentation, I could understand that NiFi is a framework with a drag and drop UI that helps to build data pipelines. This NiFi flow can be exported into a template which is then saved in Git.
Are we supposed to import this template into production NiFi server? Once imported, are we supposed to manually start all the processors via the UI? Please help.
Templates are just example flows to share with people and are not really meant for deployment. Please take a look at NiFi Registry and the concept of versioned flows.
https://nifi.apache.org/registry.html
https://www.youtube.com/watch?v=X_qhRVChjZY&feature=youtu.be
https://bryanbende.com/development/2018/01/19/apache-nifi-how-do-i-deploy-my-flow
Template is xml representation of your process structure (processors, processor groups, controllers, relationships etc). You can upload it to another nifi server for deploy. You also can start all the processors by nifi-api.

Call MapReduce from Local a web app

I have a web app deploy in my localhost.
Also I have a MapReduce job (cleandata.jar) in a hortonworks sandbox in my pc.
How can I call from my web app to my MapReduce .jar?
I'm trying with JSch y Channel Exec to do this in order to perform a call system to the virtual machine and this works. There are a more elegant/easy form to do this?
I didn't use Hortonworks Sandbox but the proper way of launching Yarn (and MapReduce) applications programmatically is by using YarnClient java class. It's quite complicated tough because you need to know some Hadoop internals to do that. First, you should have network access to ResourceManager, NodeManagers, DataNodes and NameNode. Next you should set Configuration properties according to hdfs-site.xml and yarn-site.xml files you will probably find in the sandbox (you can copy them and put them into classpath of your webapp).
You can take a look here: https://hadoop.apache.org/docs/r2.7.2/hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html
Note that if your cluster is secured, webapp from which you will submit the job have to be run on Java with extended security (JCE) and you should authenticate using UserGroupInformation.

What services can I turn off to trigger a second application attempt?

Hadoop YARN includes a configuration to modify how many times an application can be started: yarn.resourcemanager.am.max-attempts.
I am interested in hitting this limit to observe how the system may fail, and I want to be able to do it without modifying code. To mimic production scenarios, I would like to turn off other Hadoop services to cause a second attempt of the application.
What services can I turn off during the application run to trigger another application attempt?
For simplicity, just close storage services(hosting your source data or target data). For example, hdfs service, hive service, etc.

Capacity scheduler in Amazon Elastic MapReduce

I am totally new to Amazon Elastic MapReduce. I have a need that I want to use my custom scheduler, which is implemented based on Hadoop capacity scheduler, to schedule my jobs in Amazon Elastic MapReduce.
According to my current understanding, to achieve this, I can define only one stage in the job flow, and submit my custom jar file via SSH connection to the master node. However, I cannot find how can I edit the xml configuration files, like capacity-scheduler.xml in the master node. Anyone knows how to do that?
Moreover, if I want to add the dynamic sizing property onto it, can I dynamically tune the number of task nodes in the cluster, when the job is currently running? Or in per stage, the size of a cluster should remain the same? Thank you so much.
You should use a bootstrap action to change Hadoop configuration.
The following AWS doc can be referenced for Hadoop configuratio bootstrap action.
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-bootstrap.html#PredefinedbootstrapActions_ConfigureHadoop
This blog article that I bookmarked also has some info.
http://sujee.net/tech/articles/hadoop/amazon-emr-beyond-basics/
For changing the cluster size dynamically, one option is to use the AWS SDK.
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/calling-emr-with-java-sdk.html
Using the following interface you can modify the instance count of the instance group.
http://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/elasticmapreduce/AmazonElasticMapReduce.html

Resources