What is stateless node? How Hadoop nodes are stateless? - hadoop

Does stateless node mean just being independent of each others? can you explain this concept w.r.t to hadoop

The explanation can be as follows: each mapper/reducer has no idea about all the other mappers/reducers (i.e. about their current states, their particular outputs if any, etc.). Such statelessness is not great for certain data processing workloads (e.g. graph data) but allows easy parallelization (a particular map/reduce task can be run on any node, meaning a failed mapper/reducer is not an issue, just start a new one on the same input split/mappers' outputs).

I would say that statefulness of the nodes in computing infrastructures has slightly different meaning from what you have defined. Remember there is always coordination process running somewhere, so there is no complete independence between the nodes.
What it can actually mean in computing infrastructures is that the nodes does not store anything about the computation they are performing on persistent storage. Consider the following, you have master running on some machine delegating the tasks to the workers, the workers maintain the information in RAM and retrieve it from RAM when necessary for task computation. Workers also write results into RAM. You can consider the worker nodes as stateless, since whenever the worker node fails (from power cut for example) it would not have any mechanism which would allow it to recover the execution from the point it has stopped at. But still master will know that the node has failed and would delegate the task to another machine in the cluster.
Regarding Hadoop, the architecture is statefull, first of all, because whenever the job is starting its execution it will transfer all the metadata to the worker node (the jar file, split location, etc). Secondly, when the job is scheduled on the node which does not contain the input data, it will be transferred there. Additionally, the intermediate data is being stored on the disk, exactly for failure recovery reasons, so the failure recovery mechanisms can resume the job from the point where execution has stopped.

Related

Scheduled tasks in cluster using zookeeper

We use Spring to run scheduled tasks which works fine with single node. We want to run these scheduled tasks in cluster of N nodes such that the tasks are executed atmost by one node at a point of time. This is for enterprise use case and we may expect upto 10 to 20 nodes.
I looked into various options:
Use Quartz which seems to be a popular choice for running scheduled tasks in a cluster. Drawback: Database dependency which I want to avoid.
Use zookeeper and always run the scheduled tasks only on the leader/master node. Drawback: Task execution load is not distributed
Use zookeeper and have the scheduled tasks invoke on all nodes. But before the task runs acquire distributed lock and release once execution is complete.
Drawback: The system clock on all nodes should be in sync which may be an issue if application is overloaded causing system clock drift.
Use zookeeper and let the master node keep producing the task as per the schedule and assign it to a random worker. A new task is not assigned if previous scheduled task has not been worked on yet. Drawback: This appears to add too much complexity.
I am inclining towards using #3 which appears to be a safe solution assuming the zookeeper ensemble nodes run on a separate cluster with system clock in sync using NTP. This is also on assumption that if system clocks are in sync, then all nodes have equal chance of acquiring the lock to execute a task.
EDIT: After some more thought I realize this may not be a safe solution either since the system clock should be in sync between the nodes where the scheduled tasks are running not just the zookeeper cluster nodes. I am saying not safe because the nodes where the tasks are running can be overloaded with GC pauses and other reasons and there is possibility of clocks going out of sync. But again I would think this is a standard problem with distributed systems.
Could you please advise if my understanding on each of the options is accurate? Or may be there is a better approach than the listed options to solved this problem.
Well, you can improve the #3 like this.
Zookeeper provide watchers. That is, you can set a watcher on a given ZNode (say at path /some/path). All your nodes in the cluster are watching the same Znode. Whenever a node thinks(as scheduled or whatever way) it should now run the scheduled task,
First it create a PERSISTENT_SEQUENTIAL child node under /some/path (which all the nodes are watching). Also, you can set the data of that node as you wish. It may be a json string specifying the details about the task to be run. The new ZNode path will look like /some/path/prefix_<sequence-number>.
Then, all the nodes in the cluster will be notified about the child node created. All of them then fetch the newly created ZNode's data and decode the task.
Now, each node try to acquire a distributed lock. Whoever acquiring it first can execute it. Once executed, that node should report (Say by creating a new ZNode under /some/path/prefix_<sequence-number> with name success), that that task was executed. Then release the lock.
Whenever a node is trying to execute a task, before trying to acquire the distributed lock, it should check if that ZNode already has a success child node.
This design ensures that no task is run twice by checking the child node with name success under a given ZNode created to notify to start a task.
I have used the above design for an enterprise solution. Actually for a distributed command framework ;-)
Zookeeper or Etcd aren't the best tools for this use case.
If your environment allows you to use akka it would be easier for you to use akka cluster + smallest mailbox router or whatever cluster router you prefer. Then push schedule jobs to the ActorRef for the cluster. Easier to set up, you can set up thousands of nodes in a cluster using it (it uses swim the protocol cassandra and nomad use).
Scalecube also would do it rather easily again it uses SWIM.

What is the principle of "code moving to data" rather than data to code?

In a recent discussion about distributed processing and streaming I came across the concept of 'code moving to data'. Can someone please help explaining the same. Reference for this phrase is MapReduceWay.
In terms of Hadoop, it's stated in a question but still could not figure out an explanation of the principle in a tech agnostic way.
The basic idea is easy: if code and data are on different machines, one of them must be moved to the other machine before the code can be executed on the data. If the code is smaller than the data, better to send the code to the machine holding the data than the other way around, if all the machines are equally fast and code-compatible. [Arguably you can send the source and JIT compile as needed].
In the world of Big Data, the code is almost always smaller than the data.
On many supercomputers, the data is partitioned across many nodes, and all the code for the entire application is replicated on all nodes, precisely because the entire application is small compared to even the locally stored data. Then any node can run the part of the program that applies to the data it holds. No need to send the code on demand.
I also just came across the sentence “Moving Computation is Cheaper than Moving Data” (from the Apache Hadoop documentation) and after some reading I think this refers to the principle of data locality.
Data locality is a strategy for task scheduling aimed at optimizing performance based on the observation that moving data across a network is costly, so when choosing which task to prioritize whenever a computing/data node is free, preference will be given to the task that's going to operate on the data in the free node or in its proximity.
This (from Delay Scheduling: A Simple Technique for Achieving
Locality and Fairness in Cluster Scheduling, Zaharia et al., 2010) explains it clearly:
Hadoop’s default scheduler runs jobs in FIFO order, with five priority levels. When the scheduler receives a heartbeat indicating that a map
or reduce slot is free, it scans through jobs in order of priority and submit time to find one with a task of the required type. For maps,
Hadoop uses a locality optimization as in Google’s MapReduce [18]: after selecting a job, the scheduler greedily picks the map task in
the job with data closest to the slave (on the same node if possible, otherwise on the same rack, or finally on a remote rack).
Note that the fact Hadoop replicates data across nodes increases fair scheduling of tasks (the higher the replication, the higher the probability of a task to have data on the next free node and hence get picked to run next).

Apache Tez architecture Explanation

I was trying to see what makes Apache Tez with Hive much faster than map reduce with hive.
I am not able to understand DAG concept.
Anyone have a good reference for understanding the architecture of Apache TEZ.
The presentation from Hadoop Summit (slide 35) discussed how the DAG approach is optimal vs MapReduce paradigm:
http://www.slideshare.net/Hadoop_Summit/murhty-saha-june26255pmroom212
Essentially it will allow higher level tools (like Hive and Pig) to define their overall processing steps (aka workflow, aka Directed Acyclical Graph) before the job begins. A DAG is a graph of all the steps needed to complete the job (hive query, Pig job, etc.). Because the entire job's steps can be computed before execution time, the system can take advantage of caching intermediate job results "in memory". Whereas, in MapReduce all intermediate data between MapReduce phases required writing to HDFS (disk) adding latency.
YARN also allows container reuse for Tez tasks. E.g. each server is chopped into multiple "containers" rather than "map" or "reduce" slots. For any given point in the job execution this allows Tez to use the entire cluster for the map phases or the reduce phases as needed. Whereas in Hadoop v1 prior to YARN, the number of map slots (and reduce slots) were fixed/hard coded at the platform level. Better utilization of all available cluster resources generally leads to faster
Apache Tez represents an alternative to the traditional MapReduce that allows for jobs to meet demands for fast response times and extreme throughput at petabyte scale.
Higher-level data processing applications like Hive and Pig need an execution framework that can express their complex query logic in an efficient manner and then execute it with high performance which is managed by Tez. Tez achieves this goal by modeling data processing not as a single job, but rather as a data flow graph.
… with vertices in the graph representing application logic and edges representing movement
of data. A rich dataflow definition API allows users to express complex query logic in an
intuitive manner and it is a natural fit for query plans produced by higher-level
declarative applications like Hive and Pig... [The] dataflow pipeline can be expressed as
a single Tez job that will run the entire computation. Expanding this logical graph into a
physical graph of tasks and executing it is taken care of by Tez.
Data Processing API in Apache Tez blog post describes a simple Java API used to express a DAG of data processing. The API has three components
•DAG. this defines the overall job. The user creates a DAG object for each data processing job.
•Vertex. this defines the user logic and the resources & environment needed to execute the user logic. The user creates a Vertex object for each step in the job and adds it to the DAG.
•Edge. this defines the connection between producer and consumer vertices. The user creates an Edge object and connects the producer and consumer vertices using it.
Edge properties defined by Tez enable it to instantiate user tasks, configure their inputs and outputs, schedule them appropriately and define how to route data between the tasks. Tez also allows to define parallelism for each vertex execution by specifying user guidance, data size and resources.
Data movement: Defines routing of data between tasks ◦One-To-One: Data from the ith producer task routes to the ith consumer task.
Broadcast: Data from a producer task routes to all consumer tasks.
Scatter-Gather: Producer tasks scatter data into shards and consumer tasks gather the shards. The ith shard from all producer tasks routes to the ith consumer task.
Scheduling. Defines when a consumer task is scheduled ◦Sequential: Consumer task may be scheduled after a producer task completes.
Concurrent: Consumer task must be co-scheduled with a producer task.
Data source: Defines the lifetime/reliability of a task output ◦Persisted: Output will be available after the task exits. Output may be lost later on.
Persisted-Reliable: Output is reliably stored and will always be available
Ephemeral: Output is available only while the producer task is running.
Additional details on Tez architecture are presented in this Apache Tez Design Doc.
I am not yet using Tez but I have read about it. I think the main two reasons that will make Hive to run faster over Tez are:
Tez will share data between Map Reduce jobs in memory when possible, avoiding the overhead of writing/ reading to/ from HDFS
With Tez you can run multiple map/ reduce DAGs defined on Hive, in one Tez session without needing to start a new application master each time.
You can find a list of links that will help you to understand Tez better here: http://hortonworks.com/hadoop/tez/
Tez is a DAG (Directed acyclic graph) architecture. A typical Map reduce job has following steps:
Read data from file -->one disk access
Run mappers
Write map output --> second disk access
Run shuffle and sort --> read map output, third disk access
write shuffle and sort --> write sorted data for reducers --> fourth disk access
Run reducers which reads sorted data --> fifth disk output
Write reducers output -->sixth disk access
Tez works very similar to Spark (Tez was created by Hortonworks well before Spark):
Execute the plan but no need to read data from disk.
Once ready to do some calculations (similar to actions in spark), get the data from disk and perform all steps and produce output.
Only one read and one write.
Notice the efficiency introduced by not going to disk multiple times. Intermediate results are stored in memory (not written to disks). On top of that there is vectorization (process batch of rows instead of one row at a time). All this adds to efficiencies in query time.
References http://www.slideshare.net/Hadoop_Summit/w-235phall1pandey
https://community.hortonworks.com/questions/83394/difference-between-mr-and-tez.html
Main difference to MR and TEZ is writing intermediate data to local disk in MR. But, in TEZ, either mapper/reducer functionality will execute in an single instance on each container using in memory. TEZ is moreover performs operations like transactions or actions in spark operations.

Why Map tasks outputs are written to the local disk and not to HDFS?

I am prepping for an exam and here is a question in the lecture notes:
Why Map tasks outputs are written to the local disk and not to HDFS?
Here are my thoughts:
Reduce network traffic usage as the reducer may run on the same machine as the output so copying not required.
Don't need the fault tolerance of HDFS. If the job dies halfway, we can always just re-run the map task.
What are other possible reasons? Are my answers reasonable?
Your reasonings are correct. However I would like to add few points: what if map outputs are written to hdfs. Now, writing to hdfs is not like writing to local disk. It's a more involved process with namenode assuring that at least dfs.replication.min copies are written to hdfs. And namenode will also run a background thread to make additional copies for under replicated blocks. Suppose, the user kills the job in between or jobs just fail. There will be lots of intermediate files sitting on hdfs for no reason which you will have to delete manually. And if this process happens too many times, your cluster's perform and will degrade. Hdfs is optimized for appending and not frequent deleting .Also, during map phase , if the job fails, it performs a cleanup before exiting. If it were hdfs, the deletion process would require namenode to send a block deletion message to appropriate datanodes, which will cause invalidation of that block and it's removal from blocksMap. So much operation involved just for a failed cleanup and for no gain!!
Because it doesn’t use valuable cluster bandwidth. This is called the data locality optimization. Sometimes, however, all the nodes hosting the HDFS block replicas for a map task’s input split are running other map tasks, so the job scheduler will look for a free map slot on a node in the same rack as one of the blocks. Very occasionally even this is not possible, so an off-rack node is used, which results in an inter-rack network transfer.
from "Hadoop The Definitive Guide 4 edition"
There is a point I know of writing the map output to Local file system , the output of all the mappers eventually gets merged and finally made a input for shuffling and sorting stages that precedes Reducer phase.

For a large mapreduce job, with a few lingering reducers, can this job be safely downsized?

Chris Smith answered this question and said I could post it.
If you have a 200-node mapreduce job, with just 3 running reduce jobs
left lingering, is it safe to switch off all nodes except the master
and the 3 with the running jobs?
Plus maybe a handful more in case of a bad node needing replacement?
If the answer to this question is "yes" it's strange that emr doesn't
automatically turn off most of the nodes nodes when they're not in
use.
Lately there's been several jobs that mostly finished, but with a few
reduces lingering. I think this is costing us since the not-in-use
nodes stay up.
There are these issues I can think of:
-- when does data get copied to S3? If a node is not in use in terms
of running reduce, could it still be needed for copying to S3? In that
case, answer to my question is you're basicaly never safe to switch
off nodes
-- what happens if one of 3 jobs fails? Master/job coordinator should
reassign it to another node. I guess you're safe as long as it can
keep track of what boxes are up, and not wrongly assign to a box that
has been shut off.
If you have a 200-node mapreduce job, with just 3 running reduce jobs
left lingering, is it safe to switch off all nodes except the master
and the 3 with the running jobs?
Plus maybe a handful more in case of a bad node needing replacement?
If the answer to this question is "yes" it's strange that emr doesn't
automatically turn off most of the nodes nodes when they're not in
use.
Keep in mind, EMR is a very thin layer over Hadoop. If you were doing distributed computation on Amazon's fabric, you could be a TON more efficient with something customized for its specific needs which would not really resemble Hadoop or Map/Reduce at all. If you're doing a lot of heavy work with Hadoop, you are often better off with your own cluster or at least with a dedicated cluster in the cloud (that way data is already sliced up on local disk and output need only be persisted to local disk). EMR's main virtue is that it is quick and dirty and hooks in nicely to other parts of AWS (like S3).
Lately there's been several jobs that mostly finished, but with a few
reduces lingering. I think this is costing us since the not-in-use
nodes stay up.
It most definitely is costing you, particularly in terms of runtime. I'd start by being concerned about why the completion times are so non-uniform.
There are these issues I can think of:
-- when does data get copied to S3? If a node is not in use in terms
of running reduce, could it still be needed for copying to S3? In that
case, answer to my question is you're basicaly never safe to switch
off nodes
If you are referring to the output of a job, if you have S3 as your output path for your job configuration, then data from a given task will be written out to S3 before the task exits.
-- what happens if one of 3 jobs fails? Master/job coordinator should
reassign it to another node. I guess you're safe as long as it can
keep track of what boxes are up, and not wrongly assign to a box that
has been shut off.
Well... it's a bit more complicated than that... When the new node is assigned the job, it has to pull the data from somewhere. That somewhere it typically from the mappers who generated the data in the first place. If they aren't there anymore, the map tasks may need to be rerun (or more likely: the job will fail). Normally the replication factor on map output is 1, so this is an entirely plausible scenario. This is one of a few reasons why Hadoop jobs can have their "% complete" go backwards... mappers can even go back from 100% to <100%.
Related to this: it's conceivable, depending on the stage those reducer jobs are in, that they have yet to receive all of the map output that feeds in to them. Obviously in THAT case killing the wrong mapper is deadly.
I think it is important to highlight the difference between taking offline TaskTracker only nodes, vs. nodes running TaskTracker + DataNode service. If you take off more than a couple of the latter, you're going to lose blocks in HDFS, which is usually not a great thing for your job (unless you really don't use HDFS for anything other than distributing your job). You can take off a couple of nodes at a time, and then run a rebalancer to "encourage" HDFS to get the replication factor of all blocks back up to 3. Of course, this triggers network traffic and disk I/O, which might slow down your remaining tasks.
tl;dr: there can be problems killing nodes. While you can be confident that a completed task, which writes its output to S3, has completely written out all of its output by the time the JobTracker is notified the task has completed, the same can't be said for map tasks, which write out to their local directory and transfer data to reducers asynchronously. Even if all the map output has been transferred to their target reducers, if your reducers fail (or if speculative execution triggers the spinning up of a task on another node), you mail really need those other nodes, as Hadoop will likely turn to them for input data for a reassigned reducer.
--
Chris
P.S. This can actually be a big pain point for non-EMR Hadoop setups as well (instead of paying for nodes longer than you need them, it presents as having nodes sitting idle when you have work they could be doing, along with massive compute time loss due to node failures). As a general rule, the tricks to avoid the problem are: keep your tasks sizes pretty consistent and in the 1-5 minute range, enable speculative execution (really crucial in the EMR world where node performance is anything but consistent), keep replication factors up well above your expected node losses for a given job (depending on your node reliability, once you cross >400 nodes with day long job runs, you start thinking about a replication factor of 4), and use a job scheduler that allows new jobs to kick off while old jobs are still finishing up (these days this is usually the default, but it was a totally new thing introduced ~Hadoop 0.20 IIRC). I've even heard of crazy things like using SSD's for mapout dirs (while they can wear out fast from all the writes, their failure scenarios tend to be less catastrophic for a Hadoop job).

Resources