What is YARN (Hadoop) - very simple example - hadoop

I am new to Hadoop and trying to understand it. I found a nice explanation
of HDFS and MapReduce with very simple examples (see below). But I cannot
google any similar simple example for YARN. Could someone please explain it
(like for a layman)?
HDFS
Think of a file that contains the phone numbers for everyone in the United
States; the people with a last name starting with A might be stored on server
1, B on server 2, and so on.
In a Hadoop world, pieces of this phonebook would be stored across the cluster,
and to reconstruct the entire phonebook, your program would need the blocks
from every server in the cluster. To achieve availability as components fail,
HDFS replicates these smaller pieces onto two additional servers by default.
(This redundancy can be increased or decreased on a per-file basis or for a
whole environment; for example, a development Hadoop cluster typically doesn’t
need any data redundancy.) This redundancy offers multiple benefits, the most
obvious being higher availability.
In addition, this redundancy allows the Hadoop cluster to break work up into
smaller chunks and run those jobs on all the servers in the cluster for better
scalability. Finally, you get the benefit of data locality, which is critical
when working with large data sets. We detail these important benefits later in
this chapter.
MapReduce
Let’s look at a simple example. Assume you have five files, and each file
contains two columns (a key and a value in Hadoop terms) that represent a city
and the corresponding temperature recorded in that city for the various
measurement days. Of course we’ve made this example very simple so it’s easy to
follow. You can imagine that a real application won’t be quite so simple, as
it’s likely to contain millions or even billions of rows, and they might not be
neatly formatted rows at all; in fact, no matter how big or small the amount of
data you need to analyze, the key principles we’re covering here remain the
same. Either way, in this example, city is the key and temperature is the
value.
Toronto, 20
Whitby, 25
New York, 22
Rome, 32
Toronto, 4
Rome, 33
New York, 18
Out of all the data we have collected, we want to find the maximum temperature
for each city across all of the data files (note that each file might have the
same city represented multiple times). Using the MapReduce framework, we can
break this down into five map tasks, where each mapper works on one of the five
files and the mapper task goes through the data and returns the maximum
temperature for each city. For example, the results produced from one mapper
task for the data above would look like this:
(Toronto, 20) (Whitby, 25) (New York, 22) (Rome, 33)
Let’s assume the other four mapper tasks (working on the other four files not
shown here) produced the following intermediate results:
(Toronto, 18) (Whitby, 27) (New York, 32) (Rome, 37)(Toronto, 32) (Whitby, 20)
(New York, 33) (Rome, 38)(Toronto, 22) (Whitby, 19) (New York, 20) (Rome,
31)(Toronto, 31) (Whitby, 22) (New York, 19) (Rome, 30)
All five of these output streams would be fed into the reduce tasks, which
combine the input results and output a single value for each city, producing a
final result set as follows:
(Toronto, 32) (Whitby, 27) (New York, 33) (Rome, 38)
As an analogy, you can think of map and reduce tasks as the way a census was
conducted in Roman times, where the census bureau would dispatch its people to
each city in the empire. Each census taker in each city would be tasked to
count the number of people in that city and then return their results to the
capital city.
There, the results from each city would be reduced to a single count (sum of
all cities) to determine the overall population of the empire. This mapping of
people to cities, in parallel, and then combining the results (reducing) is
much more efficient than sending a single person to count every person in the
empire in a serial fashion.

Say you have 4 machines each with 4GB of RAM and dual core CPUs.
You can present YARN to an application able to distribute and parallelize workloads, such as MapReduce, and YARN will respond that it is able to accept 16GB of application workload on 8 CPU cores.
Not all nodes need to be the same, some can be used with GPU resources or higher memory throughput, but you'll always be limited by the smallest node in the group for any single running application... And the framework decides which node your code is deployed to based on resources available, not you. When a NodeManager is combined with an HDFS datanode (they're running on the same machine) your code attempting to read a file, will be attempted to be ran by that machine containing parts of the files you need.
Basically, divide your storage into small chucks (HDFS), provide a way to read those chunks into complete files (MapReduce), and use some processing engine to distribute that operation fairly, or greedily into resource pools (YARN's Fair Scheduler or Capacity Scheduler)

Related

Service architecture using technologies which provide parallelism and high scalability

I'm working on a booking system with a single RDBMS. This system has units (products) with several characteristics (attributes) like: location, size [m2], has sea view, has air conditioner…
On the top of that there is pricing with its prices for different periods e.g. 1/1/2018 – 1/4/2018 -> 30$ ... Also, there is capacity with its own periods 1/8/2017 – 1/6/2018… Availability which is the same as capacity.
Each price can have its own type: per person, per stay, per item… There are restrictions for different age groups, extra bed, …
We are talking about 100k potential units. The end user can make request to search all units in several countries, for two adults and children of 3 and 7 years, for period 1/1/2018 – 1/8/2018, where are 2 rooms with one king size bed and one single bed + one extra bed. Also, there can be other rules which are handled by rule engine.
In classical approach filtering would be done in several iterations, trying to eliminate as much as possible in each iteration. There could be done several tables with semi results which must be synchronized with every change when something has been changed through administration.
Recently I was reading about Hadoop and Storm which are highly scalable and provide parallelism. I was wondering if this kind of technology is suitable for solving described problem. Main idea is to write “one method” which validates each unit, if satisfies given filter search. Later this function is easy to extend with additional logic. Each cluster could take its own portion of the load. If there are 10 cluster, each of them could process 10k units.
In Cloudera tutorial there is a moment when with Sqoop, content from RDBMS has been transferred to HDFS. This process takes some time, so it seems it’s not a good approach to solve this problem. Given problem is highly deterministic and it requires to have immediate synchronization and to operates with fresh data. Maybe to use in some streaming service and to parallelly write into HDFS and RDBMS? Do you recommend some other technology like Storm?
What could be possible architecture, starting point, to satisfy all requirements to solve this problem.
Please point me into right direction if this problem is improper for the site.

Why increased amout of clusters speed up query in Hadoop's MapReduce?

I just started learning Hadoop, in the official guide, it mentioned that double amount of
clusters is able to make querying double size of data as fast as original.
On the other hand, traditional RDBM still spend twice amount of time on querying result.
I cannot grasp the relation between cluster and processing data. Hope someone can give me
some idea.
It's the basic idea of distributed computing.
If you have one server working on data of size X, it will spend time Y on it.
If you have 2X data, the same server will (roughly) spend 2Y time on it.
But if you have 10 servers working in parallel (in a distributed fashion) and they all have the entire data (X), then they will spend Y/10 time on it. You would gain the same effect by having 10 times more resources on the one server, but usually this is not feasible and/or doable. (Like increasing CPU power 10-fold is not very reasonable.)
This is of course a very rough simplification and Hadoop doesn't store the entire dataset on all of the servers - just the needed parts. Hadoop has a subset of the data on each server and the servers work on the data they have to produce one "answer" in the end. This requires communications and different protocols to agree on what data to share, how to share it, how to distribute it and so on - this is what Hadoop does.

Extract properties of a hadoop job

Given a large datafile and jarfile containing mapper, reducer classes , I want to be able to know , how big Hadoop cluster should be formed ( I mean how many machines I would need to form a cluster for the given job to run efficiently.)
I am running the job on the given datafile(s).
Assuming your MapReduce job scales linearly, I suggest the following test to get a general idea of what you'll need. I assume you have a time in mind when you say "run efficiently"... this might be 1 minute for someone or 1 hour for someone... it's up to you.
Run the job on one node on a subset of your data that fits on one node... or a small number of nodes more preferably. This test cluster should be representative of the type of hardware you will purchase later.
[(time job took on your test cluster) x (number of nodes in test cluster)]
x [(size of full data set) / (size of sample data set)]
/ (new time, i.e., "run efficiently")
= (number of nodes in final cluster)
Some things to note:
If you double the "time job took on test cluster", you'll need twice as many nodes.
If you halve the "new time", i.e., you want your job to run twice as fast, you'll need twice as many nodes.
The ratio of the sample tells you how much to scale the result
An example:
I have a job that takes 30 minutes on a two nodes. I am running this job over 4GB of a 400GB data set (400/4 GB). I would like it if my job took 12 minutes.
(30 minutes x 2 nodes) x (400 / 4) GB / 12 = 500 nodes
This is imperfect in a number of ways:
With one or two nodes, I'm not fully taking into account how long it'll take to transfer stuff over the network... a major part of the mapreduce job. So, you can assume it'll take longer than this estimate. If you can, test your job over 4-10 nodes and scale it from there.
Hadoop doesn't "scale down" well. There is a certain speed limit that you won't be able to cross with MapReduce. Somewhere around 2-3 minutes on most clusters I've seen. That is, you won't be making a MapReduce job run in 3 seconds by having a million nodes.
Your job might not scale linearly, in which case this exercise is flawed.
Maybe you can't find representative hardware. In which case, you'll have to factor in how much faster you think your new system will be.
In summary, there is no super accurate way of doing what you say. The best you can really do right now is experimentation and extrapolation. The more nodes you can do a test on, the better, as the extrapolation part will be more accurate.
In my experience, when testing from something like 200 nodes to 800 nodes, this metric is pretty accurate. I'd be nervous about going from 1 node or 2 nodes to 800. But 20 nodes to 800 might be OK.

Duplicate Key Filtering

I am looking for a distributed solution to screen/filter a large volume of keys in real-time. My application generates over 100 billion records per day, and I need a way to filter duplicates out of the stream. I am looking for a system to store a rolling 10 days’ worth of keys, at approximately 100 bytes per key. I was wondering how this type of large scale problem has been solved before using Hadoop. Would HBase be the correct solution to use? Has anyone ever tried a partially in-memory solution like Zookeeper?
I can see a number of solutions to your problem, but the real-time requirement really narrows it down. By real-time do you mean you want to see if a key is a duplicate as its being created?
Let's talk about queries per second. You say 100B/day (that's a lot, congratulations!). That's 1.15 Million queries per second (100,000,000,000 / 24 / 60 / 60). I'm not sure if HBase can handle that. You may want to think about something like Redis (sharded perhaps) or Membase/memcached or something of that sort.
If you were to do it in HBase, I'd simply push the upwards of a trillion keys (10 days x 100B keys) as the keys in the table, and put some value in there to store it (because you have to). Then, you can just do a get to figure out if the key is in there. This is kind of hokey and doesn't fully utilize hbase as it is only fully utilizing the keyspace. So, effectively HBase is a b-tree service in this case. I don't think this is a good idea.
If you relax the restraint to not have to do real-time, you could use MapReduce in batch to dedup. That's pretty easy: it's just Word Count without the counting. You group by the key you have and then you'll see the dups in the reducer if multiple values come back. With enough nodes an enough latency, you can solve this problem efficiently. Here is some example code for this from the MapReduce Design Patterns book: https://github.com/adamjshook/mapreducepatterns/blob/master/MRDP/src/main/java/mrdp/ch3/DistinctUserDriver.java
ZooKeeper is for distributed process communication and synchronization. You don't want to be storing trillions of records in zookeeper.
So, in my opinion, you're better served by a in-memory key/value store such as redis, but you'll be hard pressed to store that much data in memory.
I am afraid that it is impossible with traditional systems :|
Here is what U have mentioned:
100 billion per days means approximation 1 million per second!!!!
size of the key is 100 bytes.
U want to check for duplicates in a 10 day working set means 1 trillion items.
These assumptions results in look up in a set of 1 trillion objects that totally size in 90 TERABYTES!!!!!
Any solution to this real-time problem shall provide a system that can look up 1 million items per second in this volume of data.
I have some experience with HBase, Cassandra, Redis, and Memcached. I am sure that U cannot achieve this performance on any disk-based storage like HBase, Cassandra, or HyperTable (and add any RDBMSs like MySQL, PostgreSQl, and... to these). The best performance of redis and memcached that I have heard practically is around 100k operations per second on a single machine. This means that U must have 90 machines each having 1 TERABYTES of RAM!!!!!!!!
Even a batch processing system like Hadoop cannot do this job in less than an hour and I guess it will take hours and days on even a big cluster of 100 machines.
U R talking about very very very big numbers (90 TB, 1M per second). R U sure about this???

Estimating computation costs for parallel computing

I am very new to the parallel computing world. My group use Amazon EC2 and S3 to manage all the data and it really opens a new world to me.
My question is how to estimate costs for computation. Suppose I have n TB data with k files on Amazon S3 (for example, I got 0.5 TB data with 7000 zip files), I would like to loop through all the files, and perform one operation of regex matching using Pig Latin for each line of the files.
I am very interested in estimating these costs:
How many instances should I select to perform this task? What are
the capacity of the instance (the size of the master instance and
the map-reduce instance)? Can I deduct these capacities and costs
based on n and k as well as each operation cost?
I have designed an example data flow: I used one xlarge instance as
my master node, and 10 medium instances as my map reduce group.
Would this be enough?
How to maximize the bandwidth for each of these instances to fetch data from S3? From my designed dataflow, it looks like the reading speed from S3 is about 250,000,000 bytes per minute. How much data exactly are transported to the ec2 instance? Would this be the bottleneck of my job flow?
1- IMHO, it depends solely on your needs. You need to choose it based on the intensity of computation you are going to perform. You can obviously cut down the cost based on your dataset and the amount of computation you are going to perform on that data.
2- For how much data?What kind of operations?Latency/throughput?For POCs and small projects it seems good enough.
3- It actually depends on several things, like - whether you're in the same region as your S3 endpoint, the particular S3 node you're hitting at a point in time etc. You might be better off using an EBS instance if you need quicker data access, IMHO. You could mount an EBS volume to your EC2 instance and keep the data, which you frequently need, there itself. Otherwise some straightforward solutions are using 10 Gigabit connections between servers or perhaps using dedicated(costly) instances. But, nobody can guarantee whether data transfer will be a bottleneck or not. Sometimes it maybe.
I don't know if this answers you cost queries completely, but their Monthly Calculator would certainly do.

Resources