Restrict Spark job in local mode - hadoop

Is there any way to restrict access to execute spark-submit with spark deploy mode as local mode. If I permit users to execute jobs in local mode my yarn cluster will become under utilized.
I have configured to use yarn as cluster manager to schedule spark jobs.
I have checked spark configs where I did not find any parameters to allow only a specific deploy mode. User can override the default deploy mode while submitting spark jobs to the cluster.

You can incentivize and facilitate using YARN by setting the spark.master key to yarn in your conf/spark-defaults.conf file. If your configuration is ready to point to the proper master, by default users will deploy their jobs on YARN.
I'm not aware of any way to completely bar your users from using a master, especially if it's under their control (as it's the case for local). What you can do, if you control the Spark installation, is modifying the existing spark-shell/spark-submit launch script to detect if a user is trying to explicitly use local as a master and preventing this to happen. Alternatively you could also have your own script that checks and prevents any local session to be opened and then runs spark-shell/spark-submit normally.

Related

Spark Job Submission with AWS Hadoop cluster setup

I have a hadoop cluster setup in AWS EC2, but my development setup(spark) is in local windows system. When I am trying to connect AWS Hive thrift server I able to connect , but it is showing some connection refused error when trying to submit a job from my local spark configuration. Please note in windows my user name is different that the user name for which Hadoop eco system is running in AWS server. Can any one explain me how the underlying system works in this setup?
1) When I am submitting a job from my local Spark to HIVE thrift , if it is associated any MR job , ASW Hive setup will submit that job NN with its own identity or it will carry forward my spark setup identity.
2) In my configuration do I need to run spark in local with same user name as I have for hadoop cluster in AWS ?
3) Do I need to configure SSL also to authenticate my local system?
Please note , my local system is not part of hadoop cluster and I can not include also in AWS Hadoop cluster.
Please let me know what will be actual setup for environment where my hadoop cluster is in AWS and spark is running on my local.
To simplify the problem, you are free to compile your code locally, produce an uber/shaded JAR, SCP to any spark-client in AWS, then run spark-submit --master yarn --class <classname> <jar-file>.
However, if you want to just Spark against EC2 locally, then you can set a few properties programmatically.
Spark submit YARN mode HADOOP_CONF_DIR contents
Alternatively, as mentioned in that post, the best way would be getting your cluster's XML files from HADOOP_CONF_DIR, and copying them over into your application's classpath. This is typically src/main/resources for a Java/Scala application.
Not sure about Python, R, or the SSL configs.
And yes, you need to add a remote user account for your local Windows username on all the nodes. This is how user impersonation will be handled by Spark executors.

Spark History UI not working | Ambari | YARN

I have a hadoop cluster setup using Ambari which has services like HDFS,YARN,spark running on the hosts.
When i run the sample spark pi in cluster mode as master yarn, the application gets successfully executed and I can view the same from resource manager logs.
But when i click on the history link, it does not show the spark history UI. How to enable/view the same?
First, check if your spark-history server is already configured by looking for spark.yarn.historyServer.address in spark-defaults.conf file.
If not configured, this link should help you configure the server: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.6/bk_installing_manually_book/content/ch19s04s01.html
If already configured, check if the history server host is accessible from all the nodes in the cluster, and also the port is open.

Make spark environment for cluster

I made a spark application that analyze file data. Since input file data size could be big, It's not enough to run my application as standalone. With one more physical machine, how should I make architecture for it?
I'm considering using mesos for cluster manager but pretty noobie at hdfs. Is there any way to make it without hdfs (for sharing file data)?
Spark maintain couple cluster modes. Yarn, Mesos and Standalone. You may start with the Standalone mode which means you work on your cluster file-system.
If you are running on Amazon EC2, you may refer to the following article in order to use Spark built-in scripts that loads Spark cluster automatically.
If you are running on an on-prem environment, the way to run in Standalone mode is as follows:
-Start a standalone master
./sbin/start-master.sh
-The master will print out a spark://HOST:PORT URL for itself. For each worker (machine) on your cluster use the URL in the following command:
./sbin/start-slave.sh <master-spark-URL>
-In order to validate that the worker was added to the cluster, you may refer to the following URL: http://localhost:8080 on your master machine and get Spark UI that shows more info about the cluster and its workers.
There are many more parameters to play with. For more info, please refer to this documentation
Hope I have managed to help! :)

Hadoop on Amazon Cloud

I'm trying to get set up on the Amazon Cloud to run some hadoop MapReduce jobs but I'm struggling to successfully create a cluster. I have downloaded the ec2 files, have my certificates and keypair file, but I believe it's the AMIs that are causing me trouble. If I'm trying to run a cluster with a master node and n slave nodes, I start n+1 instances using standard compatible AMIs and then run the code "hadoop-ec2 launch-cluster name n" in the terminal. The master node is successful, but I get an error when the slave nodes start to launch, saying "missing parameter -h (AMI missing)" and I'm not entirely sure how to progress.
Also, some of my jobs will require an alteration in hadoops parameter settings (specifically the mapred-site.xml config file), is it possible to alter this file, and if so, how do I gain access to it? Is hadoop already installed on amazon machines, with this file accessible and alterable?
Thanks
Have you tried Amazon Elastic MapReduce? This is a simple API that brings up Hadoop clusters of a specified size on demand.
That's easier then to create own cluster manually.
But once the jobflow is finished by default it shuts the cluster down, leaving you with outputs on S3. If what you need is simply to do some crunching, this may be the way to go.
In case you need HDFS contents stored permanently (e.g. if you are running HBase on top of Hadoop) you may actually need own cluster on EC2. In this case you may find Cloudera's distribution of Hadoop for Amazon EC2 useful.
Altering Hadoop configuration on nodes it will start is possible using EC2 Bootstrap Actions:
Q: How do I configure Hadoop settings for my job flow?
The Elastic MapReduce default Hadoop configuration is appropriate for most workloads. However, based on your job flow’s specific memory and processing requirements, it may be appropriate to tune these settings. For example, if your job flow tasks are memory-intensive, you may choose to use fewer tasks per core and reduce your job tracker heap size. For this situation, a pre-defined Bootstrap Action is available to configure your job flow on startup. See the Configure Memory Intensive Bootstrap Action in the Developer’s Guide for configuration details and usage instructions. An additional predefined bootstrap action is available that allows you to customize your cluster settings to any value of your choice. See the Configure Hadoop Bootstrap Action in the Developer’s Guide for usage instructions.
About the way you are starting the cluster, please clarify:
If I'm trying to run a cluster with a master node and n slave nodes, I start n+1 instances using standard compatible AMIs and then run the code "hadoop-ec2 launch-cluster name n" in the terminal. The master node is successful, but I get an error when the slave nodes start to launch, saying "missing parameter -h (AMI missing)" and I'm not entirely sure how to progress.
How exactly you are trying start it? What exactly AMIs are you using?

How to use custom pool assignment for FairScheduler in Hadoop?

I am trying to take advantage of multiple pools in FairScheduler. But all my jobs are submitted by a single agent process and therefore all belong to same user.
I have set mapred.fairscheduler.poolnameproperty to scheduler.pool.name and then in each job I set "scheduler.pool.name" to a specific pool from pools.xml that I want to use for that job.
I can see in job configuration web page that both properties have values as expected and scheduler webpage shows all pools I am trying to use. However all jobs are still running in the pool %username% where username is name of the user that was used to submit all jobs.
I am running hadoop version 0.20.1 from Cloudera distribution.
Any ideas how to make my jobs run in a pool that is not dependent on the name of the user, who submitted the job?
Looks like restart of jobtracker was not sufficient to effect new configuration. After restart of all tasktrackers and a jobtracker pool assignment works as expected.

Resources