Parallel processing of several files in a cluster - hadoop

At the company I work for, everyday we have to process a few thousands of files, which takes some hours. The operations are basically CPU intensive, like converting PDF to high resolution images and later creating many different sizes os such images.
Each one of those tasks takes a lot of CPU, and therefore we can't simply start many instances on the same machine because there won't be any processing power available for everything. Thus, it takes some hours to finish everything.
The most obvious thing to do, as I see it, is to partition the set of files and have them processed by more machines concurrently (5, 10, 15 machines, I don't know yet how many would be necessary).
I don't want to reinvent the wheel and create a manager for task (nor do I want the hassle), but I am not sure which tool should I use.
Although we don't have big data, I have looked at Hadoop for a start (we are running at Amazon), and its capabilities of handling the nodes seem interesting. However, I don't know if it makes sense to use it. I am looking at Hazelcast as well, but I have no experience at all with it or the concepts yet.
What would be a good approach for this task?

Hadoop is being used for a wide variety of data processing problems, some of them are related to image processing also. The problem mentioned in the OP can also be easily solved using Hadoop. Note that in some cases where the data to processed is small, then there is an overhead using Hadoop.
If you are new to Hadoop, would suggest a couple of things
Buy the Hadoop : The Definitive Guide book.
Go through the MapReduce resources.
Start going through the tutorials (1 and 2) and setup Hadoop on a single node and a cluster. There is no need for Amazon, if 1-2 machines can be spared for learning.
Run the sample programs and understand how they work.
Start migrating the problem area to Hadoop.
The advantage of Hadoop over other s/w is the ecosystem around Hadoop. As of now the ecosystem around Hadoop is huge and growing, I am not sure of Hazelcast.

You can use Hazelcast distributed queue.
First you can put your files (file references) as tasks to a distributed queue.
Then each node takes a task from the queue processes it and puts the result into another distributed queue/list or write it to DB/storage.

Related

Spark Jobs on Yarn | Performance Tuning & Optimization

What is the best way to optimize the Spark Jobs deployed on Yarn based cluster ? .
Looking for changes based on configuration not code level. My Question is classically design level question, what approach should be used to optimized the Jobs that are either developed on Spark Streaming or Spark SQL.
There is myth that BigData is magic and your code will be work like a dream once deployed to a BigData cluster.
Every newbie have same belief :) There is also misconception that given configurations over web blogs will be working fine for every problem.
There is no shortcut for optimization or Tuning the Jobs over Hadoop without understating your cluster deeply.
But considering the below approach I'm certain that you'll be able to optimize your job within a couple of hours.
I prefer to apply the pure scientific approach to optimize the Jobs. Following steps can be followed specifically to start optimization of Jobs as baseline.
Understand the Block Size configured at cluster.
Check the maximum memory limit available for container/executor.
Under the VCores available for cluster
Optimize the rate of data specifically in case of Spark streaming real-time jobs. (This is most tricky park in Spark-streaming)
Consider the GC setting while optimization.
There is always room of optimization at code level, that need to be considered as well.
Control the block size optimally based on cluster configuration as per Step 1. based on data rate. Like in Spark it can be calculated batchinterval/blockinterval
Now the most important steps come here. The knowledge I'm sharing is more specific to real-time use cases like Spark streaming, SQL with Kafka.
First of all you need to know to know that at what number or messages/records your jobs work best. After it you can control the rate to that particular number and start configuration based experiments to optimize the jobs. Like I've done below and able to resolve performance issue with high throughput.
I have read some of parameters from Spark Configurations and check the impact on my jobs than i made the above grid and start the experiment with same job but with five difference configuration versions. Within three experiment I'm able to optimize my job. The green highlighted in above picture is magic formula for my jobs optimization.
Although the same parameters might be very helpful for similar use cases but obviously these parameter not covers everything.
Assuming that the application works i.e memory configuration is taken care of and we have at least one successful run of the application. I usually look for underutilisation of executors and try to minimise it. Here are the common questions worth asking to find opportunities for improving utilisation of cluster/executors:
How much of work is done in driver vs executor? Note that when the main spark application thread is in driver, executors are killing time.
Does you application have more tasks per stage than number of cores? If not, these cores will not be doing anything while in this stage.
Are your tasks uniform i.e not skewed. Since spark move computation from stage to stage (except for some stages that can be parallel), it is possible for most of your tasks to complete and yet the stage is still running because one of skewed task is still held up.
Shameless Plug (Author) Sparklens https://github.com/qubole/sparklens can answer these questions for you, automatically.
Some of things are not specific to the application itself. Say if your application has to shuffle lots of data, pick machines with better disks and network. Partition your data to avoid full data scans. Use columnar formats like parquet or ORC to avoid fetching data for columns you don't need all the time. The list is pretty long and some problems are known, but don't have good solutions yet.

Hadoop - CPU intensive application - Small data

Is Hadoop a proper solution for jobs that are CPU intensive and need to process a small file of around 500 MB? I have read that Hadoop is aimed to process the so called Big Data, and I wonder how it performs with a small amount of data (but a CPU intensive workload).
I would mainly like to know if a better approach for this scenario exists or instead I should stick to Hadoop.
Hadoop is a distributed computing framework proposing a MapReduce engine. If you can express your parallelizable cpu intensive application with this paradigm (or any other supported by Hadoop modules), you may take advantage of Hadoop.
A classical example of Hadoop computations is the calculation of Pi, which doesn't need any input data. As you'll see here, yahoo managed to determine the two quadrillonth digit of pi thanks to Hadoop.
However, Hadoop is indeed specialized for Big Data in the sense that it was developped for this aim. For instance, you dispose of a file system designed to contain huge files. These huge files are chunked into a lot of blocks accross a large number of nodes. In order to ensure your data integrity, each block has to be replicated to other nodes.
To conclude, I'd say that if you already dispose of an Hadoop cluster, you may want to take advantage of it.
If that's not the case, and while I can't recommand anything since I have no idea what exactly is your need, I think you can find more light weights frameworks than Hadoop.
Well a lot of companies are moving to Spark, and I personally believe it's the future of parallel processing.
It sounds like what you want to do is use many CPUs possibly on many nodes. For this you should use a Scalable Language especially designed for this problem - in other words Scala. Using Scala with Spark is much much easier and much much faster than hadoop.
If you don't have access to a cluster, it can be an idea to use Spark anyway so that you can use it in future more easily. Or just use .par in Scala and that will paralellalize your code and use all the CPUs on your local machine.
Finally Hadoop is indeed intended for Big Data, whereas Spark is really just a very general MPP framework.
You have exactly the type of computing issue that we do for Data Normalization. This is a need for parallel processing on cheap hardware and software with ease of use instead of going through all the special programming for traditional parallel processing. Hadoop was born of hugely distributed data replication with relatively simple computations. Indeed, the test application still being distributed, WordCount, is numbingly simplistic. This is because the genesis of Hadoop was do handle the tremendous amount of data and concurrent processing for search, with the "Big Data" analytics movement added on afterwards to try to find a more general purpose business use case. Thus, Hadoop as described in its common form is not targeted to the use case you and we have. But, Hadoop does offer the key capabilities of cheap, easy, fast parallel processing of "Small Data" with custom and complicated programming logic.
In fact, we have tuned Hadoop to do just this. We have a special built hardware environment, PSIKLOPS, that is powerful for small cluster (1-10) nodes with enough power at low cost for run 4-20 parallel jobs. We will be showcasing this in a series of web casts by Inside Analysis titled Tech Lab in conjunction with Cloudera for the first series, coming in early Aug 2014. We see this capability as being a key enabler for people like you. PSIKLOPS is not required to use Hadoop in the manner we will showcase, but it is being configured to maximize ease of use to launch multiple concurrent containers of custom Java.

hadoop map reduce vs clojure pmap function

Supposedly i have large independent sets of data in seperate excel files.
In terms of runtime efficiency, would it be better to use clojure Pmap function to process the data or hadoop map reduce?
Each excel file consists of about 34000 rows at least and i have quite a large number of them.
Sorry for the beginner question as i am relatively new to both and are doing research on them
As some of you guys have explained,
Perhaps one more question would be to compare clojure pmap against instance of running multi instances of the same copies of software, what are the differences between those?
The only thing i can think of is that pmap can take any amount of variables however, reading one file per instance of applications would require the number of files to be known upfront and the instances be initialized
I'd say use Hadoop, but not directly, but rather through Cascalog from Clojure. The value proposition here is all that Hadoop gives you plus the great declarative query language (which may well make using Cascalog worthwhile even if the task is relatively small; setup with Hadoop in local mode is completely hassle-free).
The original introductory blog posts are still the best starting point (although there's great documentation available now -- see the wiki at GitHub): the first one is here and it links to the second one at the end.
To give you a taste of what it looks like, here's a snippet from the first tutorial (finding all "follow" relationships where the follower is older than the person they follow):
(?<- (stdout) [?person1 ?person2]
(age ?person1 ?age1)
(follows ?person1 ?person2)
(age ?person2 ?age2)
(< ?age2 ?age1))
No problem running this on a cluster too, see News Feed in 38 lines of code using Cascalog on Nathan Marz's blog for an example.
I wouldn't go running and establishing an Hadoop cluster just to be able to process a lot of small files (which is not ideal for Hadoop anyway). Hadoop is geared towards handling large files (its block size is 64M) and the map reduce efficiency comes from letting having these large files distributed over the cluster and sending the computation to the data.
In your case it seems that running multiple copies of your software each processing one file at a time would solve the problem and would have the least overhead - both computational and operational (ie. setting up and maintaining hadoop).
One thing that hadoop can give you is the management of the processing task, that is retires in case of failure etc., but again, it seems and overkill for what you describe
Lots of languages have map reduce capabilities, including Clojure.
I'd say that Hadoop would be the hands-down winner because it manages it over clusters of machines. It's the potential for massive parallelization that would give it the clear edge over anything else that didn't have it built in.

Using Hadoop for Parallel Processing rather than Big Data

I manage a small team of developers and at any given time we have several on going (one-off) data projects that could be considered "Embarrassingly parallel" - These generally involve running a single script on a single computer for several days, a classic example would be processing several thousand PDF files to extract some key text and place into a CSV file for later insertion into a database.
We are now doing enough of these type of tasks that I started to investigate developing a simple job queue system using RabbitMQ with a few spare servers (with an eye to use Amazon SQS/S3/EC2 for projects that needed larger scaling)
In searching for examples of others doing this I keep coming across the classic Hadoop New York Times example:
The New York Times used 100 Amazon EC2 instances and a Hadoop application to process 4 TB of raw image TIFF data (stored in S3) into 11 million finished PDFs in the space of 24 hours at a computation cost of about $240 (not including bandwidth)
Which sounds perfect? So I researched Hadoop and Map/Reduce.
But what I can't work out is how they did it? Or why they did it?
Converting TIFF's in PDF's is not a Map/Reduce problem surely? Wouldn't a simple Job Queue have been better?
The other classic Hadoop example is the "wordcount" from the Yahoo Hadoop Tutorial seems a perfect fit for Map/Reduce, and I can see why it is such a powerful tool for Big Data.
I don't understand how these "Embarrassingly parallel" tasks are put into the Map/Reduce pattern?
TL;DR
This is very much a conceptual question, basically I want to know how would I fit a task of "processing several thousand PDF files to extract some key text and place into a CSV file" into a Map/Reduce pattern?
If you know of any examples that would be perfect, I'm not asking you to write it for me.
(Notes: We have code to process the PDF's, I'm not asking for that - it's just an example, it could be any task. I'm asking about putting that processes like that into the Hadoop Map/Reduce pattern - when there is no clear "Map" or "Reduce" elements to a task.)
Cheers!
Your thinking is right.
The above examples that you mentioned used only part of the solution that hadoop offers. They definitely used parallel computing ability of hadoop plus the distributed file system. It's not necessary that you always will need a reduce step. You may not have any data interdependency between the parallel processes that are run. in which case you will eliminate the reduce step.
I think your problem also will fit into hadoop solution domain.
You have huge data - huge number of PDF files
And a long running job
You can process these files parallely by placing your files on HDFS and running a MapReduce job. Your processing time theoretically improves by the number of nodes that you have on your cluster. If you do not see the need to aggregate the data sets that are produced by the individual threads you do not need to use a reduce step else you need to design a reduce step as well.
The thing here is if you do not need a reduce step, you are just leveraging the parallel computing ability of hadoop plus you are equipped to run your jobs on not so expensive hardware.
I need to add one more thing: error handling and retry. In a distributed environment nodes fail is pretty common. I regularly run EMR cluster consisting of several hundred nodes at time for 3 - 8 days and find out that 3 or 4 fail during that period is very likely.
Hadoop JobTracker will nicely re-submit failed tasks (up to a certain number of times) in a different node.

Hadoop: Disadvantages of using just 2 machines?

I want to do log parsing of huge amounts of data and gather analytic information. However all the data comes from external sources and I have only 2 machines to store - one as backup/replication.
I'm trying to using Hadoop, Lucene... to accomplish that. But, all the training docs mention that Hadoop is useful for distributed processing, multi-node. My setup does not fit into that architecture.
Are they any overheads with using Hadoop with just 2 machines? If Hadoop is not a good choice are there alternatives? We looked at Splunk, we like it, but that is expensive for us to buy. We just want to build our own.
Hadoop should be used for distributed batch processing problems.
5-common-questions-about-hadoop
Analysis of log files is one of the more common uses of Hadoop, its one of the tasks Facebook use it for.
If you have two machines, you by definition have a multi-node cluster. You can use Hadoop on a single machine if you want, but as you add more nodes the time it takes to process the same amount of data is reduced.
You say you have huge amounts of data? These are important numbers to understand. Personally when I think huge in terms of data, i think in the 100s terabytes+ range. If this is the case, you'll probably need more than two machines, especially if you want to use replication over the HDFS.
The analytic information you want to gather? Have you determined that these questions can be answered using the MapReduce approach?
Something you could consider would be to use Hadoop on Amazons EC2 if you have a limited amount of hardware resources. Here are some links to get you started:
hadoop-world-building-data-intensive-apps-with-hadoop-and-ec2
Hadoop Wiki - AmazonEC2

Resources