Apache Airflow Distributed Processing - hadoop

My mind is confused over the architecture over Apache Airflow.
If I know,
When you executed a hql or sqoop statement in oozie, oozie is directing the request to the data nodes.
I want to achieve same thing in Apache Airflow. I want to execute a shell script, hql or sqoop command, and I wanna be sure that my command is being executed distributely by data nodes.
Airflow have different executor types. What should I do in order to run commands in different data nodes concurrently?

It seems you want to execute your tasks on distributed workers. In that case, consider using CeleryExecutor.
CeleryExecutor is one of the ways you can scale out the number of
workers. For this to work, you need to setup a Celery backend
(RabbitMQ, Redis, …) and change your airflow.cfg to point the executor
parameter to CeleryExecutor and provide the related Celery settings.
See: https://airflow.apache.org/configuration.html#scaling-out-with-celery

Oozie is tightly coupled with Hadoop nodes and all the scripts need to be uploaded to HDFS, whereas Airflow with Celery Executor has a better architecture. With Celery executor the same script, hql can be executed concurrently in multiple nodes as well as specific nodes by using correct queues and some workers can listen to the specific queues to perform those actions.

Related

Is it possible to use Spot instances for Flink Batch job on EMR?

I have a Flink Streaming job running in a mode=Batch and I want to optimize costs. So, I wonder if anyone has experience of using Spot instances for Flink on EMR.
What should I be cautious about?
What should I take into consideration?
Can Flink schedule a job manager on a Task node?
What happens if one of the instances that held the state of previous stage computation results fail? Would both failover regions (previous and running) have to be re-computed?
Currently I am using only on-demand instances in EMR cluster.

How to get resources used for FINISHED hadoop jobs from YARN logs using job names?

I have a unix shell script which runs multiple hive scripts. I have given Job names for every hive queries inside the hive scripts.
What I need is that at the end of the shell script, I want to retrieve the resources (in terms of memory used,containers) used for the hive queries based on the job names from the YARN logs/application having appstatus as 'FINISHED'
How do I do this?
Any help would be appreciated.
You can pull this information from the Yarn History server via rest apis.
https://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-site/HistoryServerRest.html
Scroll through this documentation and you will see examples of how to get cluster level information on jobs executed and then how to get information on individual jobs.

How to execute a shell script on all nodes of an EMR cluster?

Is there a proper way to execute a shell script on every node in a running EMR hadoop cluster?
Everything I look for brings up bootstrap actions, but that only applies to when the cluster is starting, not for a running cluster.
My application is using python, so my current guess is to use boto to list the IPs of each node in the cluster, then loop through each node and execute the shell script via ssh.
Is there a better way?
If your cluster is already started, you should use steps.
The steps are executed after the cluster is started, so technically it appears to be what you are looking for.
Be careful, steps are executed only on the master node, you should connect to the rest of your nodes in some way for modifyng them.
Steps are scripts as well, but they run only on machines in the
Master-Instance group of the cluster. This mechanism allows
applications like Zookeeper to configure the master instances and
allows applications like Hbase and Apache Drill to configure
themselves.
Reference
See this also.

What is the best way to minimize the initialization time for Apache Spark jobs on Google Dataproc?

I am trying to use a REST service to trigger Spark jobs using Dataproc API client. However, each job inside the dataproc clusters take 10-15 s to initialize the Spark Driver and submit the application. I am wondering if there is an effective way to eliminate the initialization time for Spark Java jobs triggered from a JAR file in gs bucket? Some solutions I am thinking of are:
Pooling a single instance of JavaSparkContext that can be used for every Spark job
Start a single job and run Spark-based processing in a single job
Is there a more effective way? How would I implement the above ways in Google Dataproc?
Instead of writing this logic yourself, you may want to investigate the Spark Job Server: https://github.com/spark-jobserver/spark-jobserver as this should allow you to reuse spark contexts.
You can write a driver program for Dataproc which accepts RPCs from your REST server and re-use the SparkContext yourself and then submit this driver via the Jobs API, but I personally would look at the job server first.

How to schedule Hadoop jobs conditionally?

I am pretty new to Hadoop, and particularly to Hadoop Job Scheduling. Here is what I am trying to do.
I have 2 flows, each having a Hadoop job. I have freedom to put these flows either in the same project or in different ones. I don't want the Hadoop jobs to run simultaneously on the cluster, but I also want to make sure that they run alternatively.
E.g. flow_1 (with hadoop_job_1) runs and finishes -> flow_2 (with hadoop_job_2) runs and finishes -> flow_1 (with hadoop_job_1) runs and finishes and so on.
And of course, I would also like to handle special conditions gracefully.
E.g. flow_1 done, but flow_2 is not ready, then flow_1 gets chance to run again if it is ready, if flow_1 fails, flow_2 still gets its turn, etc.
I would like to know which schedulers I can explore which are capable of doing this.
We are using MapR.
Thanks
This looks to be a standard use case of oozie. Take a look at these tutorials
Executing an Oozie workflow with Pig, Hive & Sqoop actions and Oozie workflow scheduler for Hadoop

Resources