Can hadoop be used as a distributed queue server? - hadoop

I'm thinking of learning hadoop but not sure if it'll solve my problem. Basically I have a job with a queue and a bunch of workers. Each worker does a small amount of work and then either saves the results(if successful) or sends it back to the queue for further processing. My problem is scalable, is limited by the bandwidth on the network(ec2) which will never keep up with multiple cpu's crunching the data. I thought maybe I could run my jobs in Java in a hadoop cluster and have hadoop distribute the work via a queue. Would this be a better approach? I am correct in assuming hadoop can a queue and try to run jobs as locally as possible to minimize bandwidth usage and maximize cpu usage? My program is very cpu bound but most of my recent problems with its performence are related to passing work over a network(I want to keep the work as local as possible), but the difference between the hadoop tutorials I see and my problem is that in the tutorials all the work is known in advance while my program is generating new work for its self constantly(until its finally done). Would this work and would it help me reduce the impact of passing messages over a network?
Sorry I'm new to hadoop and wanted to know if it could solve my problem.

Hadoop is all about running jobs in a batch-like mode over a large data set. It's hard to get it to have some sort of queue-like behavior, but not impossible. There is Apache ZooKeeper, which will give you synchronization to build a queue if you need it.
There are plenty of tools to solve the problem it looks like you are trying to solve. I suggest taking a look at RabbitMQ. If you use python, Celery is quite fantastic.

Related

Worker node execution in Apach-Strom

Storm topology is been deployed using Storm command on machine X. Worker nodes are running on Machine Y.
Once topology has been deployed, this is ready to process tuples and workers are processing request and response.
Can anyone please suggest that how do Worker node identify work and data, as I am not sure how worker node has access of code which is not at all deployed by developer?
If code to topology is accessible to Worker Nodes, can you please where is the location of this and also suggest execution of Worker nodes?
One, your asking a fairly complex question. I've been using Storm for awhile and don't understand much about how it works internally. Here is a good article talking about the internals of Storm. It's over two years old but should still be highly relevant. I believe that Netty is now used as the internal messaging transport, it's mentioned as being experimental in the article.
As far as code being run on worker nodes, there is an configuration in storm.yaml,
storm.local.dir
When uploading the Topology, I believe it copies the jar to that location. So every different worker machine will have the necessary jar in it's configured storm.local.dir. So even though you only upload the one machine, Storm will distributed it to the necessary workers. (That's from memory and I'm not in a spot to test it at the moment. )

how to avoid filling up hadoop logs on nodes?

When our Cascading jobs encounter an error in data, they throw various exceptions… These end up in the logs, and if the logs fill up, the cluster stops working. do we have any config file to be edited/configured to avoid such scenarios?
we are using MapR 3.1.0, and we are looking for a way to limit the log use (syslogs/userlogs), without using centralized logging, without adjusting the logging level, and we are less bothered about whether it keeps the first N bytes, or the last N bytes of logs and discords remain part.
We don't really care about the logs, and we only need the first (or last) few Megs to figure out what went wrong. We don't wan't to use centralized logging, because we don't really want to keep the logs/ don't care to spend the perf overhead of replicating them. Also, correct me if I'm wrong: user_log.retain-size, has issues when JVM re-use is used.
Any clue/answer will be greatly appreciated !!
Thanks,
Srinivas
This should probably be in a different StackExchange site as it's a more of a DevOps question than a programming question.
Anyway, what you need is your DevOps to setup logrotate and configure it according to your needs. Or edit the log4j xml files on the cluster to change the way logging is done.

Spring Batch Parallel Job Scaling

I'm currently working on a Spring Batch POC and have got a pretty good handle on most of the actual Spring Batch features. I've currently got a program that uses Spring Integration to receive an HttpRequest and use message channels to eventually send the job executions to the job launcher in a queue. What we'd really like to do is implement some kind of "scheduler/load balancer" (not quite sure what to call it) before the job launcher that will look at the currently running worker nodes and the size of the input file and make a decision on how many worker nodes the job should be allowed. We would probably also want to be able to change the amount of worker nodes a job has while it is running to allow more jobs to run.
The idea is that we'd have a server running that could accept many job requests at any time, and a large cluster of machines that jobs will be partitioned onto. We'd like to be able to scale horizontally, so whenever the server isn't busy it can make full use of the hardware, as well as being able to make sure that small jobs don't get constantly blocked by larger jobs.
From my research it seems like we'd have to implement another framework to do this (do GridGain and Hadoop allow this?), but I figured I'd ask to see what people recommended to do something like this, and if there's a way to do it without implementing another large framework.
Sorry if anything is unclear or confusing, I'm just a lowly intern who started learning Spring and Spring Batch last month and I'm far from completely understanding everything, especially this scaling stuff. Just ask and I'll try to clear things up.
Thanks for any help!
Take a look at the 'spring-batch-integration' project under the spring-batch-admin umbrella project https://github.com/SpringSource/spring-batch-admin
It has a number of examples of using spring-integration to distribute work to other nodes. IN particular see the chunk and partition packages. Just swap out the spring integration channels with jms channel adapters. By distributing work partitions via JMS, you can scale out the number of worker nodes as needed.
There are a number of threads on this subject in the spring integration forum; search for 'PartitionHandler'.
Hope that helps.

Sharing a cluster with Hadoop

Is it possible to set Hadoop up so that it plays nicely with other applications on a cluster?
I'm familiar with the Torque+Maui resource scheduler, and with using HadoopOnDemand to provision temporary Hadoop clusters. But that gets pretty cumbersome if lots of people want to use Hadoop: each person has the same headache of setting up and tearing down their own mini hadoop cluster, copying data on and off thier own HDFS, etc.
It would be much cooler if we could have one permanent instance of Hadoop running that people share, with an HDFS that is always up. This would require Hadoop intelligently allocating work to nodes that aren't busy with other applications (like R, say), and not being to greedy when queueing jobs.
Is this possible?
Isn't this what the fair scheduler does?
http://hadoop.apache.org/mapreduce/docs/r0.21.0/fair_scheduler.html
We use this to run a permanent hadoop cluster with 30 users. You can have it preempt tasks to reallocate to new pools, and can set individual priorities for each pool too.

MapReduce on AWS

Anybody played around with MapReduce on AWS yet? Any thoughts? How's the implementation?
It's easy to get started.
Here's a FAQ: http://aws.amazon.com/elasticmapreduce/faqs/
And here's the Getting Started Guide: http://docs.amazonwebservices.com/ElasticMapReduce/latest/GettingStartedGuide/
If you have an EC2 account already, you can enable MapReduce and have a sample application up and running in less than 10 minutes using the AWS Management Console.
I did the pre-packaged Word Count sample application, which returns a count of each word contained in about 20 MB of text. You can provision up to 20 instances to run concurrently, though I just used 2 instances and the job completed in about 3 minutes.
The job returns a 300 KB alphabetized list of words and how often each word appears in the sample corpus.
I really like that MapReduce jobs can be written in my choice of Perl, Python, Ruby, PHP, C++, R, or Java. The process was painless and straightforward, and the interface gives good feedback on the status of your instances and the job flow.
Be aware that, since AWS charges for a full hour when an instance is created, and since the MapReduce instances are automatically terminated at the end of the job flow, the cost of multiple fast-running job flows can add up quickly.
For example, if I create a job flow that uses 20 instances and returns results in 15 minutes, and then re-run the job flow 3 more times, I'll be charged for 80 hours of machine time even though I only had 20 instances running for 1 hour.
You also have the possibility to run MapReduce (Hadoop) on AWS with StarCluster. This tool configures the cluster for you and has the advantage that you don´t have to pay the extra Amazon Elastic MapReduce Price (if you want to reduce your costs) and you could create your own Image (AMI) with your tools (this could be good if the installation of the tools can´t be done by a bootstrap script).
It is very convenient because you don't have to administer your own cluster. You just pay per use so I think it is a good idea if you have a job that needs to run once in a while. We are running Amazon MapReduce just once a month so, for our usage, it is worth it.
However, as far as I can tell, a drawback of Amazon Map Reduce is that you can't tell which Operating System is running, or even its version. This caused me problems running c++ code that compiled with g++ 4.44, some of the OS images does not support cUrl library, etc.
If you don't need any special libraries for your use case, I would say go for it.
Good answer by MB.
To be clear: you can run Hadoop clusters in two ways:
1) Run it on Amazon EC2 instances. This means that you have to install it, configure it, terminate it, etc.
2) Run it using Elastic MapReduce, or EMR: this is an automated way to run an Hadoop cluster on Amazon Web Services. You pay a little extra on top of the basic cost for EC2, but you don't need to manage anything: just upload your data, then your algorithm, then crunch. EMR will shut down the instances automatically once your jobs are finished.
Best,
Simone
EMR is the best way to use available resources with a very little added cost over EC2 however you will how time saving and easy it is. Most of the MR implementation on Cloud are using this model i.e. Apache Hadoop on Windows Azure, Mortar Data etc.. I have worked on both Amazon EMR and Apache Hadoop on Windows Azure and found incredible to use.
Also, depending on the type / duration of jobs you plan to run, you can use AWS spot instances with EMR to get better pricing.
I am working with AWS EMR. It is pretty neat. I mean once you start up their cluster and login into their Master node. You can play around with the hadoop directory structure. And do pretty cool things.. If you have a edu account don;t forget to apply for a research grant. They give unto 100$ free credits to use their AWS.
AWS EMR is a good choice when you use S3 storage for your data.
It provides out of the box integration with S3 for loading files and posting processed files.
In use cases where you need to run the job on demand, you are saved from the cost of running the whole cluster all the time, this really helps you save on instance hours.
Leveraging the above advantage, one can use AWS lambda to spawn event driven clusters.

Resources