Install Druid on AWS EMR - hadoop

Just started exploring Druid we dint find any blog on link to install Druid on AWS, Is there any chance to install Druid on AWS EMR ? If so if there are any per-defined Cloud Formation to execute it will be help full for my R&D on Druid.

its pretty straighforward to setup a basic single cluster druid
launch EMR with a single node master, like r3.4xlarge
download imply tar (comes with druid and pivot), https://docs.imply.io/on-prem/quickstart
tar -xzf imply-3.1.8.1.tar.gz cd imply-3.1.8.1
bin/supervise -c conf/supervise/quickstart.conf
If you are looking for a full cluster deploy, EMR is not the right tool.
If you know EKS / kubernetes, I think the easiest way to get started is using Helm
https://github.com/helm/charts/tree/master/incubator/druid
Other option is to look for Imply Cloud
They also solid documentation around Druid. Druid'd own documentation is pretty intense. I found imply to be better for beginners.
https://docs.imply.io/cloud/
Although for POC, a single r3.4xlarge or i3.4xlarge having some 200G storage is good enough

The most likely reason why you would not find much documentation is that the two things have a different nature.
Druid is meant to be long lived and statefull, where the EMR hadoop variant is meant to spin up and down in a more ephemeral manner. As such the combination is somewhat awkward.
Consider using a different hadoop distribution like HDP. Of course you can easily deploy it on AWS if needed, or on your own hardware if you want to minimize infra costs.
Disclaimer: I am an employee of Cloudera, the distributor of HDP, which is currently the most common hadoop platform under Druid.

Related

Kubernetes distributed filesystem

Well, my company is considering to move from Hadoop to Kubernetes. We can find solutions in Kubernetes for tools such as cassandra, sparks, etc. So the last problem for us is how to store massive amount of files in Kubernetes, saying 1 PB. FYI, we DO NOT want to use online storage services such as S3.
As far as I know, HDFS is merely used in Kubernetes and there are a few replacement products such as Torus and Quobyte. So my question is, any recommendation for the filesystem on Kubernetes? Or any better solution?
Many thanks.
You can use a Hadoop Compatible FileSystem such as Ceph or Minio. Both of which offer S3-compatible REST APIs for reading and writing. In Kubernetes, Ceph can be deployed using the Rook project.
But overall, running HDFS in Kubernetes would require stateful services like the NameNode, and DataNodes with proper affinity and network rules in place. The Hadoop Ozone project is a realization that object storage is more common for microservice workloads than HDFS block storage as reasonably trying to analyze PB of data using distributed microservices wasn't feasible. (I'm only speculating)
The alternative is to use Docker support in Hadoop & YARN 3.x

How to instal Hadoop tools on AWS cluster

I am new to Hadoop and big data. I have setup a 4 node working Hadoop cluster in AWS. I wanted to what are the different tools I can install on that and how to install them. My plan is to stream twitter data to HDFS and then looking for specific patterns . What are the tools available for this task.
Thanks in advance.
Raj
You can very easily see what technologies you can have available to your cluster when you request it, and AWS will take care of the installation.
Just go to EMR, create a cluster, then click on advanced options, and you will see something like this:
If you're asking which technology is best suited to your particular use case, then maybe you should post a separate question when you've figured out exactly what you're trying to do.

Hadoop installation on Amazon cloud

i am new to Hadoop ,i likes to go in hadoop administration line so studied basics of hadoop and tried to install hadoop in pseudo distribution mode and installed successfully and run some basic examples also, now i need to improve me further,so i need to try a way to learn hadoop installation and configuration in real time so decided to go for Amazon micro instance ,can any one please tell how to install and configure hadoop in Amazon cloud.
Thanks in Advance.
I have tried this personally and you will not really be able to use hadoop on a single micro instance due to memory restrictions. IMHO you should atleast try a medium instance to run hadoop or better yet use their elastic-mapreduce api which is a modified version of hadoop. You can run a 3 node cluster for around 00.25 cents an hour. If you really want to learn big data this is the way I went.
You should check out their documentation here
http://aws.amazon.com/documentation/elasticmapreduce/

Is there an Amazon community AMI for Hadoop/HBase?

I would like to test out Hadoop & HBase in Amazon EC2, but I am not sure how complicate it is. Is there a stable community AMI that has Hadoop & HBase installed? I am thinking of something like bioconductor AMI
Thank you.
I highly recommend using Amazon's Elastic MapReduce service, especially if you already have an AWS/EC2 account. The reasons are:
EMR comes with a working Hadoop/HBase cluster "out of the box" - you don't need to tune anything to get Hadoop/HBase working. It Just Works(TM).
Amazon EC2's networking is quite different from what you are likely used to. It has, AFAIK, a 1-to-1 NAT where the node sees its own private IP address, but it connects to the outside world on a public IP. When you are manually building a cluster, this causes problems - even using software like Apache Whirr or BigTop specifically for EC2.
An AMI alone is not likely to help you get a Hadoop or HBase cluster up and running - if you want to run a Hadoop/HBase cluster, you will likely have to spend time tweaking the networking settings etc.
To my knowledge there isn't, but you should be able to easily deploy on EC2 using Apache Whirr which is a very good alternative.
Here is a good tutorial to do this with Whirr, as the tutorial says you should be able to do this in minutes !
The key is creating a recipe like this:
whirr.cluster-name=hbase
whirr.instance-templates=1 zk+nn+jt+hbase-master,5 dn+tt+hbase-regionserver
whirr.provider=ec2
whirr.identity=${env:AWS_ACCESS_KEY_ID}
whirr.credential=${env:AWS_SECRET_ACCESS_KEY}
whirr.hardware-id=c1.xlarge
whirr.image-id=us-east-1/ami-da0cf8b3
whirr.location-id=us-east-1
You will then be able to launch your cluster with:
bin/whirr launch-cluster --config hbase-ec2.properties

Recommendations for Hadoop on EC2?

When running Hadoop in EC2, I seem to have two options:
A: Manage the cluster myself, using the EC2-specific shell scripts that come with Hadoop.
B: Use Elastic MapReduce, and pay a little extra for the convenience.
I'm leaning towards B, but I'd appreciate some advice from people with more experience. Here are my questions:
Are there any tasks that can be done with one of these methods but not the other?
Are there other options besides these two that I'm overlooking?
If I choose B, how easy would it be to go back to A? That is, what's the danger of vendor lock-in?
Third option:
You can use apache whirr to set up an hadoop cluster on ec2 (rackspace is also supported)
I have been told by people close to the Amazon Elastic MapReduce (EMR) development team that there are at least two other advantages to using EMR: a) Amazon is actively applying bug fixes and performance enhancements to the Hadoop code base used on EMR, and b) Amazon employs a high performance network between EMR servers and S3 servers that may not be available between EC2 servers and S3 servers.
UPDATE: See #mat's comments that refute the rumored advantages of using EMR.
Disclaimer: I'm the founder of Axemblr.com
There are also commercial alternatives you can use. Axemblr Tool for Cloudera CDH3 is a tool we are building that can deploy a cluster in just a few minutes with all you need (including Cloudera Hue, Mahout & Pig).
We are also building an alternative to EMR that's fully compatible from an API perspective, targeted at private clouds.
If you are wondering why it makes sense to run CDH on EC2 rather than EMR see:
http://www.quora.com/What-are-the-advantages-disadvantages-running-Clouderas-distribution-for-Hadoop-on-EC2-instances-rather-than-using-Amazons-Elastic-Map-Reduce-Service

Resources