hadoop map reduce vs clojure pmap function - hadoop

Supposedly i have large independent sets of data in seperate excel files.
In terms of runtime efficiency, would it be better to use clojure Pmap function to process the data or hadoop map reduce?
Each excel file consists of about 34000 rows at least and i have quite a large number of them.
Sorry for the beginner question as i am relatively new to both and are doing research on them
As some of you guys have explained,
Perhaps one more question would be to compare clojure pmap against instance of running multi instances of the same copies of software, what are the differences between those?
The only thing i can think of is that pmap can take any amount of variables however, reading one file per instance of applications would require the number of files to be known upfront and the instances be initialized

I'd say use Hadoop, but not directly, but rather through Cascalog from Clojure. The value proposition here is all that Hadoop gives you plus the great declarative query language (which may well make using Cascalog worthwhile even if the task is relatively small; setup with Hadoop in local mode is completely hassle-free).
The original introductory blog posts are still the best starting point (although there's great documentation available now -- see the wiki at GitHub): the first one is here and it links to the second one at the end.
To give you a taste of what it looks like, here's a snippet from the first tutorial (finding all "follow" relationships where the follower is older than the person they follow):
(?<- (stdout) [?person1 ?person2]
(age ?person1 ?age1)
(follows ?person1 ?person2)
(age ?person2 ?age2)
(< ?age2 ?age1))
No problem running this on a cluster too, see News Feed in 38 lines of code using Cascalog on Nathan Marz's blog for an example.

I wouldn't go running and establishing an Hadoop cluster just to be able to process a lot of small files (which is not ideal for Hadoop anyway). Hadoop is geared towards handling large files (its block size is 64M) and the map reduce efficiency comes from letting having these large files distributed over the cluster and sending the computation to the data.
In your case it seems that running multiple copies of your software each processing one file at a time would solve the problem and would have the least overhead - both computational and operational (ie. setting up and maintaining hadoop).
One thing that hadoop can give you is the management of the processing task, that is retires in case of failure etc., but again, it seems and overkill for what you describe

Lots of languages have map reduce capabilities, including Clojure.
I'd say that Hadoop would be the hands-down winner because it manages it over clusters of machines. It's the potential for massive parallelization that would give it the clear edge over anything else that didn't have it built in.

Related

Disadvantage of using Hadoop for processing payroll

I was asked this question in an interview. The question was asked when I was explaining the disadvantages of Hadoop.
The disadvantages I told them are:
1. Single point of failure because of single master nodes.
2. Security is not at its best.
3. Suitable for processing only very large data/files.
Now, when I am learning more about the disadvantages, I am confused whether the batch processing nature of the Hadoop that makes it unsuitable for processing payroll at an organisation?
Can you please let me know whether my assumption is correct?
The answer I gave at the interview was completely different. I told them due to the distributed nature of hadoop jobs the update of a salary at one place might not get reflected in the database very quickly and the data wont be consistent across all the nodes.
I think I should have also mentioned that due to the batch processing nature, the updates wont get reflected immediately in all the nodes.
Is this final answer will be an optimal one for the question?
As far as I know, payroll is usually a batch process, howerver the question I’d rather ask is - how many employees does the company need to have in order to need hadoop to do the payroll processing.
And depending on which version of hadoop you were talking about (1.0 - pure MR based, or 2.0 with YARN):
YARN solves most of the single point of failure problems (AFAIK), on the other hand - processing payroll in a map/reduce way seems for me insane. Even more when we can assume that most of the companies (if not all of them) have this kind of data anyway stored in RDBMS.
Summarized, I’d say that MR only makes sense if the data is also stored in HDFS and there are many other and simpler ways to distribute payroll processing over multiple machines (or multiple cores - usually this should be already enough) - especially if the necessary data is stored in an RDBMS.
Update (see comment):
why is it insane to use MR for this job? - MR is best suited for counting words - this is not a joke. The rather surprising part is, how many problems you can solve by counting words. You can create inverted indices (MR was invented by Google, and this is what Google is doing, so it isn't surprising after all, that it is so good in it).
Spotify is for instance using MR probably to count which song was how often listened to. You could imagine, they have gigantic logs (in text form or in Cassandra, ...) from each user listening to a song, and they need to create a report for the music labels about this, this is where MR is at its best.
I also know, about a friend of a friend, who is working (worked) in a company specializing in taking algorithms and transferring them to be executed in Hadoop as MR. This is done because of the great infrastructure of a Hadoop cluster like the management or fault tolerance.
However with YARN now many more programming paradigms can be implemented on top of a Hadoop (or YARN) cluster, not just MR. With Apache Twill you can even deploy your own app paradigm, or just take an existing multi-threaded application modify it slightly and you can deploy it on an existing Hadoop 2.0 cluster. - With this it might even make sense to run a payroll job on a YARN cluster - provided it is necessary because you need to do it for millions of employees.

Hadoop - CPU intensive application - Small data

Is Hadoop a proper solution for jobs that are CPU intensive and need to process a small file of around 500 MB? I have read that Hadoop is aimed to process the so called Big Data, and I wonder how it performs with a small amount of data (but a CPU intensive workload).
I would mainly like to know if a better approach for this scenario exists or instead I should stick to Hadoop.
Hadoop is a distributed computing framework proposing a MapReduce engine. If you can express your parallelizable cpu intensive application with this paradigm (or any other supported by Hadoop modules), you may take advantage of Hadoop.
A classical example of Hadoop computations is the calculation of Pi, which doesn't need any input data. As you'll see here, yahoo managed to determine the two quadrillonth digit of pi thanks to Hadoop.
However, Hadoop is indeed specialized for Big Data in the sense that it was developped for this aim. For instance, you dispose of a file system designed to contain huge files. These huge files are chunked into a lot of blocks accross a large number of nodes. In order to ensure your data integrity, each block has to be replicated to other nodes.
To conclude, I'd say that if you already dispose of an Hadoop cluster, you may want to take advantage of it.
If that's not the case, and while I can't recommand anything since I have no idea what exactly is your need, I think you can find more light weights frameworks than Hadoop.
Well a lot of companies are moving to Spark, and I personally believe it's the future of parallel processing.
It sounds like what you want to do is use many CPUs possibly on many nodes. For this you should use a Scalable Language especially designed for this problem - in other words Scala. Using Scala with Spark is much much easier and much much faster than hadoop.
If you don't have access to a cluster, it can be an idea to use Spark anyway so that you can use it in future more easily. Or just use .par in Scala and that will paralellalize your code and use all the CPUs on your local machine.
Finally Hadoop is indeed intended for Big Data, whereas Spark is really just a very general MPP framework.
You have exactly the type of computing issue that we do for Data Normalization. This is a need for parallel processing on cheap hardware and software with ease of use instead of going through all the special programming for traditional parallel processing. Hadoop was born of hugely distributed data replication with relatively simple computations. Indeed, the test application still being distributed, WordCount, is numbingly simplistic. This is because the genesis of Hadoop was do handle the tremendous amount of data and concurrent processing for search, with the "Big Data" analytics movement added on afterwards to try to find a more general purpose business use case. Thus, Hadoop as described in its common form is not targeted to the use case you and we have. But, Hadoop does offer the key capabilities of cheap, easy, fast parallel processing of "Small Data" with custom and complicated programming logic.
In fact, we have tuned Hadoop to do just this. We have a special built hardware environment, PSIKLOPS, that is powerful for small cluster (1-10) nodes with enough power at low cost for run 4-20 parallel jobs. We will be showcasing this in a series of web casts by Inside Analysis titled Tech Lab in conjunction with Cloudera for the first series, coming in early Aug 2014. We see this capability as being a key enabler for people like you. PSIKLOPS is not required to use Hadoop in the manner we will showcase, but it is being configured to maximize ease of use to launch multiple concurrent containers of custom Java.

what are the disadvantages of mapreduce?

What are the disadvantages of mapreduce? There are lots of advantages of mapreduce. But I would like to know the disadvantages of mapreduce too.
I would rather ask when mapreduce is not a suitable choice? I don't think you would see any disadvantage if you are using it as intended. Having said that, there are certain cases where mapreduce is not a suitable choice :
Real-time processing.
It's not always very easy to implement each and everything as a MR program.
When your intermediate processes need to talk to each other(jobs run in isolation).
When your processing requires lot of data to be shuffled over the network.
When you need to handle streaming data. MR is best suited to batch process huge amounts of data which you already have with you.
When you can get the desired result with a standalone system. It's obviously less painful to configure and manage a standalone system as compared to a distributed system.
When you have OLTP needs. MR is not suitable for a large number of short on-line transactions.
There might be several other cases. But the important thing here is how well are you using it. For example, you can't expect a MR job to give you the result in a couple of ms. You can't count it as its disadvantage either. It's just that you are using it at the wrong place. And it holds true for any technology, IMHO. Long story short, think well before you act.
If you still want, you can take the above points as the disadvantages of mapreduce :)
HTH
Here are some usecases where MapReduce does not work very well.
When you need a response fast. e.g. say < few seconds (Use stream
processing, CEP etc instead)
Processing graphs
Complex algorithms e.g. some machine learning algorithms like SVM, and also see 13 drawfs
(The Landscape of Parallel Computing Research: A View From Berkeley)
Iterations - when you need to process data again and again. e.g. KMeans - use Spark
When map phase generate too many keys. Thensorting takes for ever.
Joining two large data sets with complex conditions (equal case can
be handled via hashing etc)
Stateful operations - e.g. evaluate a state machine Cascading tasks
one after the other - using Hive, Big might help, but lot of overhead
rereading and parsing data.
You need to rethink/ rewrite trivial operations like Joins, Filter to achieve in map/reduce/Key/value patterns
MapReduce assumes that the job can be parallelized. But it may not be the case for all data processing jobs.
It is closely tied with Java, of course you have Pig and Hive for rescue but you lose flexibility.
First of all, it streams the map output, if it is possible to keep it in memory this will be more efficient. I originally deployed my algorithm using MPI but when I scaled up some nodes started swapping, that's why I made the transition.
The Namenode keeps track of the metadata of all files in your distributed file system. I am reading a hadoop book (Hadoop in action) and it mentioned that Yahoo estimated the metadata to be approximately 600 bytes per file. This implies if you have too many files your Namenode could experience problems.
If you do not want to use the streaming API you have to write your program in the java language. I for example did a translation from C++. This has some side effects, for example Java has a large string overhead compared to C. Since my software is all about strings this is some sort of drawback.
To be honest I really had to think hard to find disadvantages. The problems mapreduce solved for me were way bigger than the problems it introduced. This list is definitely not complete, just a few first remarks. Obviously you have to keep in mind that it is geared towards Big Data, and that's where it will perform at its best. There are plenty of other distribution frameworks out there with their own characteristics.

Mahout recommendation engine: going distributed

Does anybody know how I could transform the code found on the Mahout in Action book, regarding the recommendation engines, so that it is consistent with a Ηadoop fully-distributed environment? My main difficulty is to transform my code (that currently reads and writes from a local disk) so that it runs in a pseudo-distributed environment (such Cloudera). Is the solution to my problem as simple as this one, or I should expect something more complex than that?
A truly distributed computation is quite different than a non-distributed computation, even when computing the same result. The structure is not the same, and the infrastructure it uses is not the same.
If you are just asking how the pseudo-distributed solution works regarding local files: you would ignore the Hadoop input/output mechanism and write a Mapper that reads your input from somewhere on HDFS and copies to local disk.
If you are asking how you actually distribute the computation, then you would have to switch to use the (completely-different) distributed implementations in the project. These actually use Hadoop to split up the computation. The process above is a hack that just runs many non-distributed tasks within a Hadoop container. These implementations are however completely off-line.
If you mean that you want a real-time recommender like in the Mahout .cf.taste packages, but also want to actually use Hadoop's distributed computing power, then you need more than Mahout. It's either one or the other in Mahout; there is code that does one or the other but they are not related.
This is exactly what Myrrix is, by the way. I don't mind advertising it here since it sounds like exactly what you may be looking for. It's an evolution of the work I began in this Mahout code. Among other things, it's a 2-tier architecture that has the real-time elements of Taste but can also transparently offload the computation to a Hadoop cluster.

Parallel processing of several files in a cluster

At the company I work for, everyday we have to process a few thousands of files, which takes some hours. The operations are basically CPU intensive, like converting PDF to high resolution images and later creating many different sizes os such images.
Each one of those tasks takes a lot of CPU, and therefore we can't simply start many instances on the same machine because there won't be any processing power available for everything. Thus, it takes some hours to finish everything.
The most obvious thing to do, as I see it, is to partition the set of files and have them processed by more machines concurrently (5, 10, 15 machines, I don't know yet how many would be necessary).
I don't want to reinvent the wheel and create a manager for task (nor do I want the hassle), but I am not sure which tool should I use.
Although we don't have big data, I have looked at Hadoop for a start (we are running at Amazon), and its capabilities of handling the nodes seem interesting. However, I don't know if it makes sense to use it. I am looking at Hazelcast as well, but I have no experience at all with it or the concepts yet.
What would be a good approach for this task?
Hadoop is being used for a wide variety of data processing problems, some of them are related to image processing also. The problem mentioned in the OP can also be easily solved using Hadoop. Note that in some cases where the data to processed is small, then there is an overhead using Hadoop.
If you are new to Hadoop, would suggest a couple of things
Buy the Hadoop : The Definitive Guide book.
Go through the MapReduce resources.
Start going through the tutorials (1 and 2) and setup Hadoop on a single node and a cluster. There is no need for Amazon, if 1-2 machines can be spared for learning.
Run the sample programs and understand how they work.
Start migrating the problem area to Hadoop.
The advantage of Hadoop over other s/w is the ecosystem around Hadoop. As of now the ecosystem around Hadoop is huge and growing, I am not sure of Hazelcast.
You can use Hazelcast distributed queue.
First you can put your files (file references) as tasks to a distributed queue.
Then each node takes a task from the queue processes it and puts the result into another distributed queue/list or write it to DB/storage.

Resources