Given a large datafile and jarfile containing mapper, reducer classes , I want to be able to know , how big Hadoop cluster should be formed ( I mean how many machines I would need to form a cluster for the given job to run efficiently.)
I am running the job on the given datafile(s).
Assuming your MapReduce job scales linearly, I suggest the following test to get a general idea of what you'll need. I assume you have a time in mind when you say "run efficiently"... this might be 1 minute for someone or 1 hour for someone... it's up to you.
Run the job on one node on a subset of your data that fits on one node... or a small number of nodes more preferably. This test cluster should be representative of the type of hardware you will purchase later.
[(time job took on your test cluster) x (number of nodes in test cluster)]
x [(size of full data set) / (size of sample data set)]
/ (new time, i.e., "run efficiently")
= (number of nodes in final cluster)
Some things to note:
If you double the "time job took on test cluster", you'll need twice as many nodes.
If you halve the "new time", i.e., you want your job to run twice as fast, you'll need twice as many nodes.
The ratio of the sample tells you how much to scale the result
An example:
I have a job that takes 30 minutes on a two nodes. I am running this job over 4GB of a 400GB data set (400/4 GB). I would like it if my job took 12 minutes.
(30 minutes x 2 nodes) x (400 / 4) GB / 12 = 500 nodes
This is imperfect in a number of ways:
With one or two nodes, I'm not fully taking into account how long it'll take to transfer stuff over the network... a major part of the mapreduce job. So, you can assume it'll take longer than this estimate. If you can, test your job over 4-10 nodes and scale it from there.
Hadoop doesn't "scale down" well. There is a certain speed limit that you won't be able to cross with MapReduce. Somewhere around 2-3 minutes on most clusters I've seen. That is, you won't be making a MapReduce job run in 3 seconds by having a million nodes.
Your job might not scale linearly, in which case this exercise is flawed.
Maybe you can't find representative hardware. In which case, you'll have to factor in how much faster you think your new system will be.
In summary, there is no super accurate way of doing what you say. The best you can really do right now is experimentation and extrapolation. The more nodes you can do a test on, the better, as the extrapolation part will be more accurate.
In my experience, when testing from something like 200 nodes to 800 nodes, this metric is pretty accurate. I'd be nervous about going from 1 node or 2 nodes to 800. But 20 nodes to 800 might be OK.
Related
There are distributed computation nodes and there are set of computation tasks represented by rows in a database table (a row per task):
A node has no information about other nodes: can't talk other nodes and doesn't even know how many other nodes there are
Nodes can be added and removed, nodes may die and be restarted
A node connected only to a database
There is no limit of tasks per node
Tasks pool is not finite, new tasks always arrive
A node takes a task by marking that row with some timestamp, so that other nodes don't consider it until some timeout is passed after that timestamp (in case of node death and task not done)
The goal is to evenly distribute tasks among nodes. To achieve that I need to define some common algorithm of tasks acquisition: when a node starts, how many tasks to take?
If a node takes all available tasks, when one node is always busy and others idle. So it's not an option.
A good approach would be for each node to take tasks 1 by 1 with some delay. So
each node periodically (once in some time) checks if there are free tasks and takes only 1 task. In this way, shortly after start all nodes acquire all tasks that are more or less equally distributed. However the drawback is that because of the delay, it would take some time to take last task into processing (say there are 10000 tasks, 10 nodes, delay is 1 second: it would take 10000 tasks * 1 second / 10 nodes = 1000 seconds from start until all tasks are taken). Also the distribution is non-deterministic and skew is possible.
Question: what kind/class of algorithms solve such problem, allowing quickly and evenly distribute tasks using some sync point (database in this case), without electing a leader?
For example: nodes use some table to announce what tasks they want to take, then after some coordination steps they achieve consensus and start processing, etc.
So this comes down to a few factors to consider.
How many tasks are currently available overall?
How many tasks are currently accepted overall?
How many tasks has the node accepted in the last X minutes.
How many tasks has the node completed in the last X minutes.
Can the row fields be modified? (A field added).
Can a node request more tasks after it has finished it's current tasks or must all tasks be immediately distributed?
My inclination is do the following:
If practical, add a "node identifier" field (UUID) to the table with the rows. A node when ran generates a UUID node identifier. When it accepts a task it adds a timestamp and it's UUID. This easily let's other nodes be able to determine how many "active" nodes there are.
To determine allocation, the node determines how many tasks are available/accepted. it then notes how many many unique node identifiers (including itself) have accepted tasks. It then uses this formula to accept more tasks (ideally at random to minimize the chance of competition with other nodes). 2 * available_tasks / active_nodes - nodes_accepted_tasks. So if there are 100 available tasks, 10 active nodes, and this node has accepted 5 task already. Then it would accept: 100 / 10 - 5 = 5 tasks. If nodes only look for more tasks when they no longer have any tasks then you can just use available_tasks / active_nodes.
To avoid issues, there should be a max number of tasks that a node will accept at once.
If node identifier is impractical. I would just say that each node should aim to take ceil(sqrt(N)) random tasks, where N is the number of available tasks. If there are 100 tasks. The first node will take 10, the second will take 10, the 3rd will take 9, the 4th will take 9, the 5th will take 8, and so on. This won't evenly distribute all the tasks at once, but it will ensure the nodes get a roughly even number of tasks. The slight staggering of # of tasks means that the nodes will not all finish their tasks at the same time (which admittedly may or may not be desirable). By not fully distributing them (unless there are sqrt(N) nodes), it also reduces the likelihood of conflicts (especially if tasks are randomly selected) are reduced. It also reduces the number of "failed" tasks if a node goes down.
This of course assumes that a node can request more tasks after it has started, if not, it makes it much more tricky.
As for an additional table, you could actually use that to keep track of the current status of the nodes. Each node records how many tasks it has, it's unique UUID and when it last completed a task. Though that may have potential issues with database churn. I think it's probably good enough to just record which node has accepted the task along with when it accepted it. This is again more useful if nodes can request tasks in the future.
I am new to Hadoop and trying to understand it. I found a nice explanation
of HDFS and MapReduce with very simple examples (see below). But I cannot
google any similar simple example for YARN. Could someone please explain it
(like for a layman)?
HDFS
Think of a file that contains the phone numbers for everyone in the United
States; the people with a last name starting with A might be stored on server
1, B on server 2, and so on.
In a Hadoop world, pieces of this phonebook would be stored across the cluster,
and to reconstruct the entire phonebook, your program would need the blocks
from every server in the cluster. To achieve availability as components fail,
HDFS replicates these smaller pieces onto two additional servers by default.
(This redundancy can be increased or decreased on a per-file basis or for a
whole environment; for example, a development Hadoop cluster typically doesn’t
need any data redundancy.) This redundancy offers multiple benefits, the most
obvious being higher availability.
In addition, this redundancy allows the Hadoop cluster to break work up into
smaller chunks and run those jobs on all the servers in the cluster for better
scalability. Finally, you get the benefit of data locality, which is critical
when working with large data sets. We detail these important benefits later in
this chapter.
MapReduce
Let’s look at a simple example. Assume you have five files, and each file
contains two columns (a key and a value in Hadoop terms) that represent a city
and the corresponding temperature recorded in that city for the various
measurement days. Of course we’ve made this example very simple so it’s easy to
follow. You can imagine that a real application won’t be quite so simple, as
it’s likely to contain millions or even billions of rows, and they might not be
neatly formatted rows at all; in fact, no matter how big or small the amount of
data you need to analyze, the key principles we’re covering here remain the
same. Either way, in this example, city is the key and temperature is the
value.
Toronto, 20
Whitby, 25
New York, 22
Rome, 32
Toronto, 4
Rome, 33
New York, 18
Out of all the data we have collected, we want to find the maximum temperature
for each city across all of the data files (note that each file might have the
same city represented multiple times). Using the MapReduce framework, we can
break this down into five map tasks, where each mapper works on one of the five
files and the mapper task goes through the data and returns the maximum
temperature for each city. For example, the results produced from one mapper
task for the data above would look like this:
(Toronto, 20) (Whitby, 25) (New York, 22) (Rome, 33)
Let’s assume the other four mapper tasks (working on the other four files not
shown here) produced the following intermediate results:
(Toronto, 18) (Whitby, 27) (New York, 32) (Rome, 37)(Toronto, 32) (Whitby, 20)
(New York, 33) (Rome, 38)(Toronto, 22) (Whitby, 19) (New York, 20) (Rome,
31)(Toronto, 31) (Whitby, 22) (New York, 19) (Rome, 30)
All five of these output streams would be fed into the reduce tasks, which
combine the input results and output a single value for each city, producing a
final result set as follows:
(Toronto, 32) (Whitby, 27) (New York, 33) (Rome, 38)
As an analogy, you can think of map and reduce tasks as the way a census was
conducted in Roman times, where the census bureau would dispatch its people to
each city in the empire. Each census taker in each city would be tasked to
count the number of people in that city and then return their results to the
capital city.
There, the results from each city would be reduced to a single count (sum of
all cities) to determine the overall population of the empire. This mapping of
people to cities, in parallel, and then combining the results (reducing) is
much more efficient than sending a single person to count every person in the
empire in a serial fashion.
Say you have 4 machines each with 4GB of RAM and dual core CPUs.
You can present YARN to an application able to distribute and parallelize workloads, such as MapReduce, and YARN will respond that it is able to accept 16GB of application workload on 8 CPU cores.
Not all nodes need to be the same, some can be used with GPU resources or higher memory throughput, but you'll always be limited by the smallest node in the group for any single running application... And the framework decides which node your code is deployed to based on resources available, not you. When a NodeManager is combined with an HDFS datanode (they're running on the same machine) your code attempting to read a file, will be attempted to be ran by that machine containing parts of the files you need.
Basically, divide your storage into small chucks (HDFS), provide a way to read those chunks into complete files (MapReduce), and use some processing engine to distribute that operation fairly, or greedily into resource pools (YARN's Fair Scheduler or Capacity Scheduler)
Let's say I have a data with 25 blocks and the replication factor is 1. The mapper requires about 5 mins to read and process a single block of the data. Then how can I calculate the time for one worker node? The what about 15 nodes? Will the time be changed if we change the replication factor to 3?
I really need a help.
First of all I would advice reading some scientific papers regarding the issue (Google Scholar is a good starting point).
Now a bit of discussion. From my latest experiments I have concluded that processing time has very strong relation with amount of data you want to process (makes sense). On our cluster, on average it takes around 7-8 seconds for Mapper to read a block of 128MBytes. Now there are several factors which you need to consider in order to predict the overall execution time:
How much data the Mapper produces, which will determine moreless the time Hadoop requires to execute Shuffling
What Reducer is doing? Does it do some iterative processing? (might be slow!)
What is the configuration of the resources? (how many Mappers and Reducers are allowed to run on the same machine)
Finally are there other jobs running simultaneously? (this might be slowing down the jobs significantly, since your Reducer slots can be occupied waiting for data instead of doing useful things).
So already for one machine you are seeing the complexity of the task of predicting the time of job execution. Basically during my study I was able to conclude that in average one machine is capable of processing from 20-50 MBytes/second (the rate is calculated according to the following formula: total input size/total job running time). The processing rate includes the staging time (when your application is starting and uploading required files to the cluster for example). The processing rate is different for different use cases and greatly influenced by the input size and more importantly the amount of data produced by Mappers (once again this values are for our infrastructure and on different machine configuration you will be seeing completely different execution times).
When you start scaling your experiments, you would see in average improved performance, but once again from my study I could conclude that it is not linear and you would need to fit by yourself, for your own infrastructure the model with respective variables which would approximate the job execution time.
Just to give you an idea, I will share some part of the results. The rate when executing determine use case on 1 node was ~46MBytes/second, for 2 nodes it was ~73MBytes/second and for 3 nodes it was ~85MBytes/second (in my case the replication factor was equal to the number of nodes).
The problem is complex requires time, patience and some analytical skills to solve it. Have fun!
I just started learning Hadoop, in the official guide, it mentioned that double amount of
clusters is able to make querying double size of data as fast as original.
On the other hand, traditional RDBM still spend twice amount of time on querying result.
I cannot grasp the relation between cluster and processing data. Hope someone can give me
some idea.
It's the basic idea of distributed computing.
If you have one server working on data of size X, it will spend time Y on it.
If you have 2X data, the same server will (roughly) spend 2Y time on it.
But if you have 10 servers working in parallel (in a distributed fashion) and they all have the entire data (X), then they will spend Y/10 time on it. You would gain the same effect by having 10 times more resources on the one server, but usually this is not feasible and/or doable. (Like increasing CPU power 10-fold is not very reasonable.)
This is of course a very rough simplification and Hadoop doesn't store the entire dataset on all of the servers - just the needed parts. Hadoop has a subset of the data on each server and the servers work on the data they have to produce one "answer" in the end. This requires communications and different protocols to agree on what data to share, how to share it, how to distribute it and so on - this is what Hadoop does.
I parallelized a simulation engine in 12 threads to run it on a cluster of 12 nodes(each node running one thread). Since chances of availability of 12 systems is generally less, I also tweaked it for 6 threads(to run on 6 nodes), 4 threads(to run on 4 nodes), 3 threads(to run on 3 nodes), and 2 threads(to run on 2 nodes). I have noticed that more the number of nodes/threads, more is the speedup. But obviously, the more nodes I use, the more expensive(in terms of cost and power) the execution becomes.
I want to publish these results in a journal so I want to know if there are any laws/theorems which will help me to decide the optimum number of nodes on which I should run this program?
Thanks,
Akshey
How have you parallelised your program and what is inside each of your nodes ?
For instance, on one of my clusters I have several hundred nodes each containing 4 dual-core Xeons. If I were to run an OpenMP program on this cluster I would place a single execution on one node and start up no more than 8 threads, one for each processor core. My clusters are managed by Grid Engine and used for batch jobs, so there is no contention while a job is running. In general there is no point in asking for more than one node on which to run an OpenMP job since the shared-memory approach doesn't work on distributed-memory hardware. And there's not much to be gained by asking for fewer than 8 threads on an 8-core node, I have enough hardware available not to have to share it.
If you have used a distributed-memory programming approach, such as MPI, then you are probably working with a number of processes (rather than threads) and may well be executing these processes on cores on different nodes, and be paying the costs in terms of communications traffic.
As #Blank has already pointed out the most efficient way to run a program, if by efficiency one means 'minimising total cpu-hours', is to run the program on 1 core. Only. However, for jobs of mine which can take, say, a week on 256 cores, waiting 128 weeks for one core to finish its work is not appealing.
If you are not already familiar with the following terms, Google around for them or head for Wikipedia:
Amdahl's Law
Gustafson's Law
weak scaling
strong scaling
parallel speedup
parallel efficiency
scalability.
"if there are any laws/theorems which will help me to decide the optimum number of nodes on which I should run this program?"
There's no such general laws, because every problem has slightly different characteristics.
You can make a mathematical model of the performance of your problem on different number of nodes, knowing how much computational work has to be done, and how much communications has to be done, and how long each takes. (The communications times can be estimated by the amount of commuincations, and typical latency/bandwidth numbers for your nodes' type of interconnect). This can guide you as to good choices.
These models can be valuable for understanding what is going on, but to actually determine the right number of nodes to run on for your code for some given problem size, there's really no substitute for running a scaling test - running the problem on various numbers of nodes and actually seeing how it performs. The numbers you want to see are:
Time to completion as a function of number of processors: T(P)
Speedup as a function of number of processors: S(P) = T(1)/T(P)
Parallel efficiency: E(P) = S(P)/P
How do you choose the "right" number of nodes? It depends on how many jobs you have to run, and what's an acceptable use of computational resources.
So for instance, in plotting your timing results you might find that you have a minimum time to completion T(P) at some number of processors -- say, 32. So that might seem like the "best" choice. But when you look at the efficiency numbers, it might become clear that the efficiency started dropping precipitously long before that; and you only got (say) a 20% decrease in run time over running at 16 processors - that is, for 2x the amount of computational resources, you only got a 1.25x increase in speed. That's usually going to be a bad trade, and you'd prefer to run at fewer processors - particularly if you have a lot of these simulations to run. (If you have 2 simulations to run, for instance, in this case you could get them done in 1.25 time units insetad of 2 time units by running the two simulations each on 16 processors simultaneously rather than running them one at a time on 32 processors).
On the other hand, sometimes you only have a couple runs to do and time really is of the essence, even if you're using resources somewhat inefficiently. Financial modelling can be like this -- they need the predictions for tomorrow's markets now, and they have the money to throw at computational resources even if they're not used 100% efficiently.
Some of these concepts are discussed in the "Introduction to Parallel Performance" section of any parallel programming tutorials; here's our example, https://support.scinet.utoronto.ca/wiki/index.php/Introduction_To_Performance
Increasing the number of nodes leads to diminishing returns. Two nodes is not twice as fast as one node; four nodes even less so than two. As such, the optimal number of nodes is always one; it is with a single node that you get most work done per node.