How to instal Hadoop tools on AWS cluster - hadoop

I am new to Hadoop and big data. I have setup a 4 node working Hadoop cluster in AWS. I wanted to what are the different tools I can install on that and how to install them. My plan is to stream twitter data to HDFS and then looking for specific patterns . What are the tools available for this task.
Thanks in advance.
Raj

You can very easily see what technologies you can have available to your cluster when you request it, and AWS will take care of the installation.
Just go to EMR, create a cluster, then click on advanced options, and you will see something like this:
If you're asking which technology is best suited to your particular use case, then maybe you should post a separate question when you've figured out exactly what you're trying to do.

Related

Install Druid on AWS EMR

Just started exploring Druid we dint find any blog on link to install Druid on AWS, Is there any chance to install Druid on AWS EMR ? If so if there are any per-defined Cloud Formation to execute it will be help full for my R&D on Druid.
its pretty straighforward to setup a basic single cluster druid
launch EMR with a single node master, like r3.4xlarge
download imply tar (comes with druid and pivot), https://docs.imply.io/on-prem/quickstart
tar -xzf imply-3.1.8.1.tar.gz cd imply-3.1.8.1
bin/supervise -c conf/supervise/quickstart.conf
If you are looking for a full cluster deploy, EMR is not the right tool.
If you know EKS / kubernetes, I think the easiest way to get started is using Helm
https://github.com/helm/charts/tree/master/incubator/druid
Other option is to look for Imply Cloud
They also solid documentation around Druid. Druid'd own documentation is pretty intense. I found imply to be better for beginners.
https://docs.imply.io/cloud/
Although for POC, a single r3.4xlarge or i3.4xlarge having some 200G storage is good enough
The most likely reason why you would not find much documentation is that the two things have a different nature.
Druid is meant to be long lived and statefull, where the EMR hadoop variant is meant to spin up and down in a more ephemeral manner. As such the combination is somewhat awkward.
Consider using a different hadoop distribution like HDP. Of course you can easily deploy it on AWS if needed, or on your own hardware if you want to minimize infra costs.
Disclaimer: I am an employee of Cloudera, the distributor of HDP, which is currently the most common hadoop platform under Druid.

Hadoop installation on Amazon cloud

i am new to Hadoop ,i likes to go in hadoop administration line so studied basics of hadoop and tried to install hadoop in pseudo distribution mode and installed successfully and run some basic examples also, now i need to improve me further,so i need to try a way to learn hadoop installation and configuration in real time so decided to go for Amazon micro instance ,can any one please tell how to install and configure hadoop in Amazon cloud.
Thanks in Advance.
I have tried this personally and you will not really be able to use hadoop on a single micro instance due to memory restrictions. IMHO you should atleast try a medium instance to run hadoop or better yet use their elastic-mapreduce api which is a modified version of hadoop. You can run a 3 node cluster for around 00.25 cents an hour. If you really want to learn big data this is the way I went.
You should check out their documentation here
http://aws.amazon.com/documentation/elasticmapreduce/

Hadoop on cluster configuration /Installation

Hi i have a small doubt , I have started to use in my curiosity but now i have the following problem
My scenario is like this - i have 10 machines connected in LAN and i need to create Name Node in one system and Data Nodes in remaining 9 machines . So do i need to install Hadoop on all the 10 machines ?
For example i have ( 1.. 10 ) machines , where machine1 is Server and from machine(2..9) are slaves[Data Nodes] so do i need to install hadoop on all 10 machines ?
And i have searched a lot On Hadoop cluster network on commodity machine but i dint get any thing related to Installation [ that is configuration]. Some of them given like how to config and install Hadoop on own system but not on the clustered environment
Can any one help me ? and give me the detailed idea or article suggested links to do the above process
Thanks
Yes, you need Hadoop installed in every node and each node should have the services started as for appropriate for its role. Also the configuration files, present on each node, have to coherently describe the topology of the cluster, including location/name/port for various common used resources (eg. namenode). Doing this manually, from scratch, is error prone, specially if you never did this before and you don't know exactly what you're trying to do. Also would be good to decide on a specific distribution of Hadoop (HortonWorks, Cloudera, HDInsight, Intel, etc)
I would recommend use one of the many deployment solutions out there. My favorite is Puppet, but I'm sure Chef will do too.
A different (perhaps better?) alternative is to use Ambari, which is a Hadoop specialized deployment and administering solution. See Deploying and Managing Hadoop Clusters with AMBARI.
Some Puppet resources to get you started: Using Vagrant, Puppet, Testing & Hadoop
Please verify below tutorial
http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/
Hope it helps
Yes hadoop needs to be there on all the computers
For clustered Environment please go through the video

Recommendations for Hadoop on EC2?

When running Hadoop in EC2, I seem to have two options:
A: Manage the cluster myself, using the EC2-specific shell scripts that come with Hadoop.
B: Use Elastic MapReduce, and pay a little extra for the convenience.
I'm leaning towards B, but I'd appreciate some advice from people with more experience. Here are my questions:
Are there any tasks that can be done with one of these methods but not the other?
Are there other options besides these two that I'm overlooking?
If I choose B, how easy would it be to go back to A? That is, what's the danger of vendor lock-in?
Third option:
You can use apache whirr to set up an hadoop cluster on ec2 (rackspace is also supported)
I have been told by people close to the Amazon Elastic MapReduce (EMR) development team that there are at least two other advantages to using EMR: a) Amazon is actively applying bug fixes and performance enhancements to the Hadoop code base used on EMR, and b) Amazon employs a high performance network between EMR servers and S3 servers that may not be available between EC2 servers and S3 servers.
UPDATE: See #mat's comments that refute the rumored advantages of using EMR.
Disclaimer: I'm the founder of Axemblr.com
There are also commercial alternatives you can use. Axemblr Tool for Cloudera CDH3 is a tool we are building that can deploy a cluster in just a few minutes with all you need (including Cloudera Hue, Mahout & Pig).
We are also building an alternative to EMR that's fully compatible from an API perspective, targeted at private clouds.
If you are wondering why it makes sense to run CDH on EC2 rather than EMR see:
http://www.quora.com/What-are-the-advantages-disadvantages-running-Clouderas-distribution-for-Hadoop-on-EC2-instances-rather-than-using-Amazons-Elastic-Map-Reduce-Service

Any tested Frameworks/Solutions similar to Apache Hadoop?

I am interested in the Apache Hadoop project, but i would like to know if any other tested (please mind the 'tested') projects/frameworks are out there.
Appreciate any information/links to projects similar to Apache Hadoop and any comments on the Apache Hadoop project from anyone that has used it.
Regards,
As mentioned in an answer to this question:
https://stackoverflow.com/questions/2168558/is-there-anything-like-hadoop-in-c
MongoDB might be something you could look at. Its a scalable database which allows MapReduce algorithms to be run against it.
There are indeed open-source projects utilizing and funding on Hadoop.
See Apache Mahout for data mining: http://lucene.apache.org/mahout/
And are you aware of the other MR implementations available?
http://en.wikipedia.org/wiki/MapReduce#Implementations
Maybe. But none of them will have anywhere near the testing a real world experience that hadoop does. Companies like facebook and yahoo are paying to scale hadoop and I know of no similar open source projects that are really worth looking at.
A possible way is to use org.apache.hadoop.hbase.MiniDFSCluster and org.apache.hadoop.mapred.MiniMRCluster, which are used in testing hadoop itself.
What they do is to launch a small cluster locally. To test your program, make hdfs-site.xml stuffs pointing to local cluster, and add them to your classpath. And this local cluster is just like another cluster but smaller. You can reference hadoop/src/test/*-site.xml as templates.
For more example, take a look at hadoop/src/test/.
There is a Hadoop-like framework, built over Hadoop, giving importance to prioritized execution of iterative algorithms.
It is tested. I have run The WordCount example on it. It is very very similar to Hadoop (especially the installation)
You can find the paper here :
http://rio.ecs.umass.edu/mnilpub/papers/socc11-zhang.pdf
and the code here
https://code.google.com/p/priter/
Hope this helps
A

Resources