running Hadoop software on office computers (when they are idle) - hadoop

Is there a project which helps setup a Hadoop cluster on office desktops, when they are idle?
I'd like to experiment with Hadoop/MR/hbase but don't have acces to 5-10 computers. The computers at work are idle after hours and are connected to each other through a very high speed connection. What's more, data on these computers stays within our network so there is no privacy issue.
In order for this to work I need a fairly light weight monitor running on each machine. When the computer has been idle for X hours, it will join the cluster. If the user logs on, it has to drop out of the cluster and return all CPU/memory back.
Does something like this exist?

You can use task scheduler to detect idle state and then start/stop a hadoop vm with virtual box or vmplayer. Or you can write a powershell script that does start stop based on resource usage.

Hadoop is not a computation grid it is a more a data grid (see slide 9 in this presentation). The point is that with hadoop that data is spread over the cluster and thus the data has to be stored on the computers. The time it would take to copy the data over/remove it when they're not idle would probably not be worth it - you'd be better off using hadoop in the cloud (amazon,Azure etc.)

I would use something like Condor: http://research.cs.wisc.edu/condor/

You might want to take a look at Virginia Tech's Project Moon http://www.wired.com/wiredenterprise/2012/05/project_moon/

Look at solutions like NEREUS which is a good MPC solution in Java

Related

When one node is very slow in Hadoop Cluster?

I've hadoop cluster of 5 nodes.
I've two concerns
1) What can be done when one of the node is running or processing data very slow (Not stopped) comapre to other nodes .. ?
2) I've set up log4j to capture logs, but How can I keep logs of all nodes at Name node or at one main server .. ?
Please suggest ...!
Thanks
To question one, it's not clear which service is slow... Datanode? Namenode? Maybe you need to increase the heap sizes of these processes, or the Dataset you've stored is heavily skewed onto that server.
You would need to install monitoring software to capture IO, CPU, network, etc metrics to really diagnose any hardware bottlenecks. From there, make sure that that one server is running latest OS patches, has latest drivers, and a similar hardware profile of other machines you're comparing against. Maybe the hard drive or NIC is failing, but without hardware diagnostic software, it'd be hard to know
For question 2, you'd again need additional software, such as Elasticsearch, to centrally collect and index your logs from many systems

Marklogic latency : Document not found

I am working on a clustered marklogic environment where we have 10 Nodes. All nodes are shared E&D Nodes.
Problem that we are facing:
When a page is written in marklogic its takes some time (upto 3 secs) for all the nodes in the cluster to get updated & its during this time if I then do a read operation to fetch the previously written page, its not found.
Has anyone experienced this latency issue? and looked at eliminating it then please let me know.
Thanks
It's normal for a new document to only appear after the database transaction commits. But it is not normal for a commit to take 3-sec.
Which version of MarkLogic Server?
Which OS and version?
Can you describe the hardware configuration?
How large are these documents? All other things equal, update time should be proportional to document size.
Can you reproduce this with a standalone host? That should eliminate cluster-related network latency from the transaction, which might tell you something. Possibly your cluster network has problems, or possibly one or more of the hosts has problems.
If you can reproduce the problem with a standalone host, use system monitoring to see what that host is doing at the time. On linux I favor something like iostat -Mxz 5 and top, but other tools can also help. The problem could be disk I/O - though it would have to be really slow to result in 3-sec commits. Or it might be that your servers are low on RAM, so they are paging during the commit phase.
If you can't reproduce it with a standalone host, then I think you'll have to run similar system monitoring on all the hosts in the cluster. That's harder, but for 10 hosts it is just barely manageable.

Is there any tool to find at what time of the day hadoop cluster is usually free from load and submit job at that time daily

I need to schedule a job in our production cluster.I am trying to schedule it at a time when the cluster is expected to be free based on how cluster load was in past 30 days.Oozie doesn't have any feature that supports this out of the box.I am trying to achieve this using some hacks within oozie.
Is there any standard way to find at what times cluster was usually free in the past few days? and automatically submit the job at that time everyday.
Linkedin White elephant seems to be the one that you are looking for. Ganglia has pretty good APIs to gauge cluster usage, which you could use.
You can use Cloudera manager for checking the complete cluster health (if you are using CDH).
There are Cloudera Manager APIs to interact. you can look at that also to get your work-around.
http://blog.cloudera.com/blog/2012/09/automating-your-cluster-with-cloudera-manager-api/

Can hadoop be used as a distributed queue server?

I'm thinking of learning hadoop but not sure if it'll solve my problem. Basically I have a job with a queue and a bunch of workers. Each worker does a small amount of work and then either saves the results(if successful) or sends it back to the queue for further processing. My problem is scalable, is limited by the bandwidth on the network(ec2) which will never keep up with multiple cpu's crunching the data. I thought maybe I could run my jobs in Java in a hadoop cluster and have hadoop distribute the work via a queue. Would this be a better approach? I am correct in assuming hadoop can a queue and try to run jobs as locally as possible to minimize bandwidth usage and maximize cpu usage? My program is very cpu bound but most of my recent problems with its performence are related to passing work over a network(I want to keep the work as local as possible), but the difference between the hadoop tutorials I see and my problem is that in the tutorials all the work is known in advance while my program is generating new work for its self constantly(until its finally done). Would this work and would it help me reduce the impact of passing messages over a network?
Sorry I'm new to hadoop and wanted to know if it could solve my problem.
Hadoop is all about running jobs in a batch-like mode over a large data set. It's hard to get it to have some sort of queue-like behavior, but not impossible. There is Apache ZooKeeper, which will give you synchronization to build a queue if you need it.
There are plenty of tools to solve the problem it looks like you are trying to solve. I suggest taking a look at RabbitMQ. If you use python, Celery is quite fantastic.

MapReduce on AWS

Anybody played around with MapReduce on AWS yet? Any thoughts? How's the implementation?
It's easy to get started.
Here's a FAQ: http://aws.amazon.com/elasticmapreduce/faqs/
And here's the Getting Started Guide: http://docs.amazonwebservices.com/ElasticMapReduce/latest/GettingStartedGuide/
If you have an EC2 account already, you can enable MapReduce and have a sample application up and running in less than 10 minutes using the AWS Management Console.
I did the pre-packaged Word Count sample application, which returns a count of each word contained in about 20 MB of text. You can provision up to 20 instances to run concurrently, though I just used 2 instances and the job completed in about 3 minutes.
The job returns a 300 KB alphabetized list of words and how often each word appears in the sample corpus.
I really like that MapReduce jobs can be written in my choice of Perl, Python, Ruby, PHP, C++, R, or Java. The process was painless and straightforward, and the interface gives good feedback on the status of your instances and the job flow.
Be aware that, since AWS charges for a full hour when an instance is created, and since the MapReduce instances are automatically terminated at the end of the job flow, the cost of multiple fast-running job flows can add up quickly.
For example, if I create a job flow that uses 20 instances and returns results in 15 minutes, and then re-run the job flow 3 more times, I'll be charged for 80 hours of machine time even though I only had 20 instances running for 1 hour.
You also have the possibility to run MapReduce (Hadoop) on AWS with StarCluster. This tool configures the cluster for you and has the advantage that you don´t have to pay the extra Amazon Elastic MapReduce Price (if you want to reduce your costs) and you could create your own Image (AMI) with your tools (this could be good if the installation of the tools can´t be done by a bootstrap script).
It is very convenient because you don't have to administer your own cluster. You just pay per use so I think it is a good idea if you have a job that needs to run once in a while. We are running Amazon MapReduce just once a month so, for our usage, it is worth it.
However, as far as I can tell, a drawback of Amazon Map Reduce is that you can't tell which Operating System is running, or even its version. This caused me problems running c++ code that compiled with g++ 4.44, some of the OS images does not support cUrl library, etc.
If you don't need any special libraries for your use case, I would say go for it.
Good answer by MB.
To be clear: you can run Hadoop clusters in two ways:
1) Run it on Amazon EC2 instances. This means that you have to install it, configure it, terminate it, etc.
2) Run it using Elastic MapReduce, or EMR: this is an automated way to run an Hadoop cluster on Amazon Web Services. You pay a little extra on top of the basic cost for EC2, but you don't need to manage anything: just upload your data, then your algorithm, then crunch. EMR will shut down the instances automatically once your jobs are finished.
Best,
Simone
EMR is the best way to use available resources with a very little added cost over EC2 however you will how time saving and easy it is. Most of the MR implementation on Cloud are using this model i.e. Apache Hadoop on Windows Azure, Mortar Data etc.. I have worked on both Amazon EMR and Apache Hadoop on Windows Azure and found incredible to use.
Also, depending on the type / duration of jobs you plan to run, you can use AWS spot instances with EMR to get better pricing.
I am working with AWS EMR. It is pretty neat. I mean once you start up their cluster and login into their Master node. You can play around with the hadoop directory structure. And do pretty cool things.. If you have a edu account don;t forget to apply for a research grant. They give unto 100$ free credits to use their AWS.
AWS EMR is a good choice when you use S3 storage for your data.
It provides out of the box integration with S3 for loading files and posting processed files.
In use cases where you need to run the job on demand, you are saved from the cost of running the whole cluster all the time, this really helps you save on instance hours.
Leveraging the above advantage, one can use AWS lambda to spawn event driven clusters.

Resources