How to find the Heron scheduler files when deployed it on Aurora and Mesos? - mesos

I can't find the heron log files when it was deployed on Aurora and Mesos. The aurora-scheduler ui showd as follows:
And In the log-files directory, there is no scheduler related log.
So do you have any ideas about this question? Where can I find the heron scheduler log files?

Heron Aurora scheduler is executed on your local host when a topology is submitted. It prepares everything, update the state, and then invoke the aurora client. If you add --verbose to your submit command, you can see a lot of logs there. E.g.
[2018-06-28 09:42:35 -0700] [INFO] org.apache.heron.scheduler.aurora.AuroraLauncher: Launching topology in aurora
...
...
[2018-06-28 09:42:35 -0700] [FINE] org.apache.heron.spi.utils.ShellUtils: Running synced process: ``aurora job create --wait-until RUNNING --bind
There's no log regarding the scheduling event saved once the topology starts to running inside aurora containers.

Related

How to launch an EMR Cluster in AWS Data PipeLine only after completion of an activity in pipeline flow

Is it possible to launch an EMR Cluster only after completion of one my activity in the AWS Data Pipeline flow.
Unload some data from Redshift (which might take an hour or hour+).
Start EMR Cluster
Execute a SPARK job in EMR cluster
Execute some other activity
Terminate the cluster
So, I want to have a dependency like "Start EMr Cluster" should depend on "unload data from Redshift to S3", and "Terminate Cluster" should depend on "Execute Spark job, Execute some other activity".
Can some one help me on this.
-Krish
You can do it by utilizing Precondition
Here is the docs for more details
Also you can do it by using copy operation and put copy operation as precondition
So you can put dependency as precondition once that satisfies then create EMR cluster.
https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-concepts-preconditions.html

Spark History UI not working | Ambari | YARN

I have a hadoop cluster setup using Ambari which has services like HDFS,YARN,spark running on the hosts.
When i run the sample spark pi in cluster mode as master yarn, the application gets successfully executed and I can view the same from resource manager logs.
But when i click on the history link, it does not show the spark history UI. How to enable/view the same?
First, check if your spark-history server is already configured by looking for spark.yarn.historyServer.address in spark-defaults.conf file.
If not configured, this link should help you configure the server: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.6/bk_installing_manually_book/content/ch19s04s01.html
If already configured, check if the history server host is accessible from all the nodes in the cluster, and also the port is open.

Submitting jobs to Spark EC2 cluster remotely

I've set up the EC2 cluster with Spark. Everything works, all master/slaves are up and running.
I'm trying to submit a sample job (SparkPi). When I ssh to cluster and submit it from there - everything works fine. However when driver is created on a remote host (my laptop), it doesn't work. I've tried both modes for --deploy-mode:
--deploy-mode=client:
From my laptop:
./bin/spark-submit --master spark://ec2-52-10-82-218.us-west-2.compute.amazonaws.com:7077 --class SparkPi ec2test/target/scala-2.10/ec2test_2.10-0.0.1.jar
Results in the following indefinite warnings/errors:
WARN TaskSchedulerImpl: Initial job has not accepted any resources;
check your cluster UI to ensure that workers are registered and have
sufficient memory 15/02/22 18:30:45
ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 0 15/02/22 18:30:45
ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 1
...and failed drivers - in Spark Web UI "Completed Drivers" with "State=ERROR" appear.
I've tried to pass limits for cores and memory to submit script but it didn't help...
--deploy-mode=cluster:
From my laptop:
./bin/spark-submit --master spark://ec2-52-10-82-218.us-west-2.compute.amazonaws.com:7077 --deploy-mode cluster --class SparkPi ec2test/target/scala-2.10/ec2test_2.10-0.0.1.jar
The result is:
.... Driver successfully submitted as driver-20150223023734-0007 ...
waiting before polling master for driver state ... polling master for
driver state State of driver-20150223023734-0007 is ERROR Exception
from cluster was: java.io.FileNotFoundException: File
file:/home/oleg/spark/spark12/ec2test/target/scala-2.10/ec2test_2.10-0.0.1.jar
does not exist. java.io.FileNotFoundException: File
file:/home/oleg/spark/spark12/ec2test/target/scala-2.10/ec2test_2.10-0.0.1.jar
does not exist. at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:397)
at
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:329) at
org.apache.spark.deploy.worker.DriverRunner.org$apache$spark$deploy$worker$DriverRunner$$downloadUserJar(DriverRunner.scala:150)
at
org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:75)
So, I'd appreciate any pointers on what is going wrong and some guidance how to deploy jobs from remote client. Thanks.
UPDATE:
So for the second issue in cluster mode, the file must be globally visible by each cluster node, so it has to be somewhere in accessible location. This solve IOException but leads to the same issue as in the client mode.
The documentation at:
http://spark.apache.org/docs/latest/security.html#configuring-ports-for-network-security
lists all the different communication channels used in a Spark cluster. As you can see, there are a bunch where the connection is made from the Executor(s) to the Driver. When you run with --deploy-mode=client, the driver runs on your laptop, so the executors will try to make a connection to your laptop. If the AWS security group that your executors run under blocks outbound traffic to your laptop (which the default security group created by the Spark EC2 scripts doesn't), or you are behind a router/firewall (more likely), they fail to connect and you get the errors you are seeing.
So to resolve it, you have to forward all the necessary ports to your laptop, or reconfigure your firewall to allow connection to the ports. Seeing as a bunch of the ports are chosen at random, this means opening up a wide range of, if not all ports. So probably using --deploy-mode=cluster, or client from the cluster, is less painful.
I advise against submitting spark jobs remotely using the port opening strategy, because it can create security problems and is in my experience, more trouble than it's worth, especially due to having to troubleshoot the communication layer.
Alternatives:
1) Livy - now an Apache project! http://livy.io or http://livy.incubator.apache.org/
2) Spark Job server - https://github.com/spark-jobserver/spark-jobserver

I am not sure whether the application is running on just the master or the whole cluster for Spark on EC2

I am using Spark 1.1.1 . I followed the instructions given on https://spark.apache.org/docs/1.1.1/ec2-scripts.html and have a cluster of 1 master node and 1 worker on EC2 running.
I have made a jar of the application and rsynced it to the slaves. When I run the application using spark-submit with the deploy-mode of client, the application works. However, when I do so using deploy-mode cluster it gives me an error saying it cannot find the jar on the worker. The permission of the jar is 755 on both the master and worker.
I am not sure whether when I run the application using deploy-mode=client whether the application is using the workers. I don't think it is since the worker url does not show any completed jobs. But it does show failed jobs during deploy-mode=cluster.
Am I doing something wrong? Thank you for your help.
You can check if executors are assigned to the application on the /executors page on port 4040 (e.g. http://localhost:4040/executors/). If you only see <driver> then you are not using the worker. If you see one line for <driver> and one other line (with ID 0, unless it has restarted), then the worker is also providing an executor to your application. Here you can also see how many tasks it has completed for your application, and other stats.

Running Hadoop/Storm tasks on Apache Marathon

I recently came across Apache Mesos and successfully deployed my Storm topology over Mesos.
I want to try running Storm topology/Hadoop jobs over Apache Marathon (had issues running Storm directly on Apache Mesos using mesos-storm framework).
I couldn't find any tutorial/article that could list steps how to launch a Hadoop/Spark tasks from Apache Marathon.
It would be great if anyone could provide any help or information on this topic (possibly a Json job definition for Marathon for launching storm/hadoop job).
Thanks a lot
Thanks for your reply, I went ahead and deployed a Storm-Docker cluster on Apache Mesos with Marathon. For service discovery I used HAProxy. This setup allows services (nimbus or zookeeper etc) to talk to each other with the help of ports, so for example adding multiple instances for a service is not a problem since the cluster will find them using the ports and loadbalance the requests between all the instances of a service. Following is the GitHub project which has the Marathon recipes and Docker images: https://github.com/obaidsalikeen/storm-marathon
Marathon is intended for long-running services, so you could use it to start your JobTracker or Spark scheduler, but you're better off launching the actual batch jobs like Hadoop/Spark tasks on a batch framework like Chronos (https://github.com/airbnb/chronos). Marathon will restart tasks when the complete/fail, whereas Chronos (a distributed cron with dependencies) lets you set up scheduled jobs and complex workflows.
While a little outdated, the following tutorial gives a good example.
http://mesosphere.com/docs/tutorials/etl-pipelines-with-chronos-and-hadoop/

Resources