hadoop behind the scenes - hadoop

Can someone explain what is hadoop in terms of the ideas behind the software ? What makes it so popular and/or powerful ?

Hadoop is a programming environment which enables running massive computations in parallel on a large cluster of machines. It is resilient to loss of several machines, scalable to enable faster computations by adding machines and trackable to report the computation status.
Hadoop is popular because it is a strong open source environment and because many users, including large ones such as Yahoo!, Microsoft and Facebook, employ it for large data-crunching projects.
It is powerful because it uses the map/reduce algorithm, which decomposes a computation into a sequence of two simple operations:
map - Take a list of items and perform the same simple operation on each of them. For example, take the text of a web page, tokenize it and replace every token with the string :1
reduce - Take a list of items and accumulate it using an accumulation operator. For example, take the list of :1, count the occurence of and output a list of the form :nt, where nt is the number of times appeared in the original list.
Using proper decomposition (Which the programmer does) and task distribution and monitoring (which Hadoop does) you get a fast scalable computation; In our example - a word-counting computation. You can sequence tens of maps and reduces and get implementations of sophisticated algorithms.
This is the very high level view. Now go read about MapReduce and Hadoop in further detail.

Hadoop implements Google's MapReduce algorithm, to understand it better you must read Google's MapReduce paper over at http://research.google.com/archive/mapreduce.html

Related

What is the principle of "code moving to data" rather than data to code?

In a recent discussion about distributed processing and streaming I came across the concept of 'code moving to data'. Can someone please help explaining the same. Reference for this phrase is MapReduceWay.
In terms of Hadoop, it's stated in a question but still could not figure out an explanation of the principle in a tech agnostic way.
The basic idea is easy: if code and data are on different machines, one of them must be moved to the other machine before the code can be executed on the data. If the code is smaller than the data, better to send the code to the machine holding the data than the other way around, if all the machines are equally fast and code-compatible. [Arguably you can send the source and JIT compile as needed].
In the world of Big Data, the code is almost always smaller than the data.
On many supercomputers, the data is partitioned across many nodes, and all the code for the entire application is replicated on all nodes, precisely because the entire application is small compared to even the locally stored data. Then any node can run the part of the program that applies to the data it holds. No need to send the code on demand.
I also just came across the sentence “Moving Computation is Cheaper than Moving Data” (from the Apache Hadoop documentation) and after some reading I think this refers to the principle of data locality.
Data locality is a strategy for task scheduling aimed at optimizing performance based on the observation that moving data across a network is costly, so when choosing which task to prioritize whenever a computing/data node is free, preference will be given to the task that's going to operate on the data in the free node or in its proximity.
This (from Delay Scheduling: A Simple Technique for Achieving
Locality and Fairness in Cluster Scheduling, Zaharia et al., 2010) explains it clearly:
Hadoop’s default scheduler runs jobs in FIFO order, with five priority levels. When the scheduler receives a heartbeat indicating that a map
or reduce slot is free, it scans through jobs in order of priority and submit time to find one with a task of the required type. For maps,
Hadoop uses a locality optimization as in Google’s MapReduce [18]: after selecting a job, the scheduler greedily picks the map task in
the job with data closest to the slave (on the same node if possible, otherwise on the same rack, or finally on a remote rack).
Note that the fact Hadoop replicates data across nodes increases fair scheduling of tasks (the higher the replication, the higher the probability of a task to have data on the next free node and hence get picked to run next).

Hadoop - CPU intensive application - Small data

Is Hadoop a proper solution for jobs that are CPU intensive and need to process a small file of around 500 MB? I have read that Hadoop is aimed to process the so called Big Data, and I wonder how it performs with a small amount of data (but a CPU intensive workload).
I would mainly like to know if a better approach for this scenario exists or instead I should stick to Hadoop.
Hadoop is a distributed computing framework proposing a MapReduce engine. If you can express your parallelizable cpu intensive application with this paradigm (or any other supported by Hadoop modules), you may take advantage of Hadoop.
A classical example of Hadoop computations is the calculation of Pi, which doesn't need any input data. As you'll see here, yahoo managed to determine the two quadrillonth digit of pi thanks to Hadoop.
However, Hadoop is indeed specialized for Big Data in the sense that it was developped for this aim. For instance, you dispose of a file system designed to contain huge files. These huge files are chunked into a lot of blocks accross a large number of nodes. In order to ensure your data integrity, each block has to be replicated to other nodes.
To conclude, I'd say that if you already dispose of an Hadoop cluster, you may want to take advantage of it.
If that's not the case, and while I can't recommand anything since I have no idea what exactly is your need, I think you can find more light weights frameworks than Hadoop.
Well a lot of companies are moving to Spark, and I personally believe it's the future of parallel processing.
It sounds like what you want to do is use many CPUs possibly on many nodes. For this you should use a Scalable Language especially designed for this problem - in other words Scala. Using Scala with Spark is much much easier and much much faster than hadoop.
If you don't have access to a cluster, it can be an idea to use Spark anyway so that you can use it in future more easily. Or just use .par in Scala and that will paralellalize your code and use all the CPUs on your local machine.
Finally Hadoop is indeed intended for Big Data, whereas Spark is really just a very general MPP framework.
You have exactly the type of computing issue that we do for Data Normalization. This is a need for parallel processing on cheap hardware and software with ease of use instead of going through all the special programming for traditional parallel processing. Hadoop was born of hugely distributed data replication with relatively simple computations. Indeed, the test application still being distributed, WordCount, is numbingly simplistic. This is because the genesis of Hadoop was do handle the tremendous amount of data and concurrent processing for search, with the "Big Data" analytics movement added on afterwards to try to find a more general purpose business use case. Thus, Hadoop as described in its common form is not targeted to the use case you and we have. But, Hadoop does offer the key capabilities of cheap, easy, fast parallel processing of "Small Data" with custom and complicated programming logic.
In fact, we have tuned Hadoop to do just this. We have a special built hardware environment, PSIKLOPS, that is powerful for small cluster (1-10) nodes with enough power at low cost for run 4-20 parallel jobs. We will be showcasing this in a series of web casts by Inside Analysis titled Tech Lab in conjunction with Cloudera for the first series, coming in early Aug 2014. We see this capability as being a key enabler for people like you. PSIKLOPS is not required to use Hadoop in the manner we will showcase, but it is being configured to maximize ease of use to launch multiple concurrent containers of custom Java.

what are the disadvantages of mapreduce?

What are the disadvantages of mapreduce? There are lots of advantages of mapreduce. But I would like to know the disadvantages of mapreduce too.
I would rather ask when mapreduce is not a suitable choice? I don't think you would see any disadvantage if you are using it as intended. Having said that, there are certain cases where mapreduce is not a suitable choice :
Real-time processing.
It's not always very easy to implement each and everything as a MR program.
When your intermediate processes need to talk to each other(jobs run in isolation).
When your processing requires lot of data to be shuffled over the network.
When you need to handle streaming data. MR is best suited to batch process huge amounts of data which you already have with you.
When you can get the desired result with a standalone system. It's obviously less painful to configure and manage a standalone system as compared to a distributed system.
When you have OLTP needs. MR is not suitable for a large number of short on-line transactions.
There might be several other cases. But the important thing here is how well are you using it. For example, you can't expect a MR job to give you the result in a couple of ms. You can't count it as its disadvantage either. It's just that you are using it at the wrong place. And it holds true for any technology, IMHO. Long story short, think well before you act.
If you still want, you can take the above points as the disadvantages of mapreduce :)
HTH
Here are some usecases where MapReduce does not work very well.
When you need a response fast. e.g. say < few seconds (Use stream
processing, CEP etc instead)
Processing graphs
Complex algorithms e.g. some machine learning algorithms like SVM, and also see 13 drawfs
(The Landscape of Parallel Computing Research: A View From Berkeley)
Iterations - when you need to process data again and again. e.g. KMeans - use Spark
When map phase generate too many keys. Thensorting takes for ever.
Joining two large data sets with complex conditions (equal case can
be handled via hashing etc)
Stateful operations - e.g. evaluate a state machine Cascading tasks
one after the other - using Hive, Big might help, but lot of overhead
rereading and parsing data.
You need to rethink/ rewrite trivial operations like Joins, Filter to achieve in map/reduce/Key/value patterns
MapReduce assumes that the job can be parallelized. But it may not be the case for all data processing jobs.
It is closely tied with Java, of course you have Pig and Hive for rescue but you lose flexibility.
First of all, it streams the map output, if it is possible to keep it in memory this will be more efficient. I originally deployed my algorithm using MPI but when I scaled up some nodes started swapping, that's why I made the transition.
The Namenode keeps track of the metadata of all files in your distributed file system. I am reading a hadoop book (Hadoop in action) and it mentioned that Yahoo estimated the metadata to be approximately 600 bytes per file. This implies if you have too many files your Namenode could experience problems.
If you do not want to use the streaming API you have to write your program in the java language. I for example did a translation from C++. This has some side effects, for example Java has a large string overhead compared to C. Since my software is all about strings this is some sort of drawback.
To be honest I really had to think hard to find disadvantages. The problems mapreduce solved for me were way bigger than the problems it introduced. This list is definitely not complete, just a few first remarks. Obviously you have to keep in mind that it is geared towards Big Data, and that's where it will perform at its best. There are plenty of other distribution frameworks out there with their own characteristics.

hadoop map reduce vs clojure pmap function

Supposedly i have large independent sets of data in seperate excel files.
In terms of runtime efficiency, would it be better to use clojure Pmap function to process the data or hadoop map reduce?
Each excel file consists of about 34000 rows at least and i have quite a large number of them.
Sorry for the beginner question as i am relatively new to both and are doing research on them
As some of you guys have explained,
Perhaps one more question would be to compare clojure pmap against instance of running multi instances of the same copies of software, what are the differences between those?
The only thing i can think of is that pmap can take any amount of variables however, reading one file per instance of applications would require the number of files to be known upfront and the instances be initialized
I'd say use Hadoop, but not directly, but rather through Cascalog from Clojure. The value proposition here is all that Hadoop gives you plus the great declarative query language (which may well make using Cascalog worthwhile even if the task is relatively small; setup with Hadoop in local mode is completely hassle-free).
The original introductory blog posts are still the best starting point (although there's great documentation available now -- see the wiki at GitHub): the first one is here and it links to the second one at the end.
To give you a taste of what it looks like, here's a snippet from the first tutorial (finding all "follow" relationships where the follower is older than the person they follow):
(?<- (stdout) [?person1 ?person2]
(age ?person1 ?age1)
(follows ?person1 ?person2)
(age ?person2 ?age2)
(< ?age2 ?age1))
No problem running this on a cluster too, see News Feed in 38 lines of code using Cascalog on Nathan Marz's blog for an example.
I wouldn't go running and establishing an Hadoop cluster just to be able to process a lot of small files (which is not ideal for Hadoop anyway). Hadoop is geared towards handling large files (its block size is 64M) and the map reduce efficiency comes from letting having these large files distributed over the cluster and sending the computation to the data.
In your case it seems that running multiple copies of your software each processing one file at a time would solve the problem and would have the least overhead - both computational and operational (ie. setting up and maintaining hadoop).
One thing that hadoop can give you is the management of the processing task, that is retires in case of failure etc., but again, it seems and overkill for what you describe
Lots of languages have map reduce capabilities, including Clojure.
I'd say that Hadoop would be the hands-down winner because it manages it over clusters of machines. It's the potential for massive parallelization that would give it the clear edge over anything else that didn't have it built in.

Hadoop map-reduce v/s cascading, which is better when compare on basis processing time?

I have used cascading as well M/R, cascading job looks slow as compare to M/R. It looks me 25% to 50% slow. Is it true or i need to dig more in cascading for optimization.
I can't speak to the overhead of a Cascading job compared to a hand drawn raw MapReduce job as it really depends on the workload complexity, version of Cascading, how you wrote each job, the weather inside Amazon or your network, etc.
That said, Cascading is an abstraction over MapReduce and there will be some overhead. But as an abstraction, it has opportunities to do things more efficiently (1.2 will lazily deserialize data during sorting for example, something a raw MR developer would need to code manually for each intermediate object via a Comparator implementation).
My suspicion is that you are assuming Cascading makes some sort of cluster configuration optimizations over and above the defaults. It does not. So if you run a Cascading Flow without setting any different Hadoop properties, it's likely you will only see one reducer in each job as that's the default in Hadoop (see mapred-default.xml).
Or your job is simple enough it can use 'Combiners', which Cascading does not support directly, but has a more flexible alternative using Map side partial aggregation. This is similar to combiners, but it trades memory for cpu, and they are not limited to commutative-associative operations like Combiners are. Here is a better description of partial aggregation.
I should say if your workload is simple enough (and will stay simple) (and Hadoop is really justified here) that you can write a couple MR jobs to satisfy it, you should probably stick with that (yet see below).
But the work I do (and I'm the author of Cascading) results in dozens of, if not a hundred in some cases, MR jobs. The fact that I can actually complete my project and get results within days outweighs the minor overhead Cascading may impart in some cases. For example, Cascading has a fail-fast planner, that is, it will not run a Cascading Flow on the cluster if all the data/field dependencies are not satisfied in the Flow.
It is very unlikely you can have that feature if you are chaining raw MR jobs together. it is more likely your workload will fail hours later because of a missing dependency that can only be identified at runtime.
Or, you are passing raw typed 'business objects' around (in order to gain compiler type safety), which means you are either passing data through the cluster unnecessarily, or have dozens of intermediary objects you must manually maintain as you change the business rules of the data processing either upstream or downstream.
Another point on the number of MR jobs. The only way to decrease the cost of a workload in Hadoop is to reduce IO between jobs in the workload. This is typically done by replacing inefficient algorithms with better ones at the cost of adding complexity, adding more jobs to do things more intelligently. So if you think you only need a handful of MR jobs, and you discover a nasty bottleneck in your data when running at scale (which is what always happens to me at least). You may need to take a different approach to the problem that will likely result in a couple more jobs. I know this seems counterintuitive, but it happens a lot. In such cases you will be glad you are working with an abstraction where you can keep your head in the problem domain, not the MapReduce domain.
If you really are concerned about performance, please feel free to email the Cascading mail list with your code, and I or the community would be glad to help identify any issues with it or in Cascading.

Resources