How to use Druid with Ambari? - hadoop

I am very new to Druid, a column-oriented, open-source, distributed data store written in Java.
I need to start multiple services (nodes) in order to work smoothly Druid. Is there a good way to auto start the services?

You can find patch for Ambari Druid integration, AMBARI-17981, and which will be included as of Ambari v2.5.
Patch file contains all that information in the form of a diff file.
Typically you need to checkout the source code, apply the patch, and then build the project.

You could use the Hortonworks Data Platform (HDP)/distribution that will install Zookeeper/HDFS/Druid/Postgresql/Hadoop and you are good to go.
There is also a video guide available on how to install Druid step-by-step.
Otherwise you can do it your self by building Druid from source and copy jars and configs around.

Related

What is Apache Maven and how do i install Geomesa-FS in my Ubuntu 20.04 through Apache maven?

I am completely new to spatiotemporal data analysis and I saw geomesa providing all the functionality that I need in my project.
Lets say i have a pd dataframe or an SQL server with all the location data like
latitude
longitude
shopid
and
latitude
longitude
customerid
timestamp
Now Geomesa will help me analysis all nearest shops to a customer on their route and weather to show an ad of that shop to the customer. (To my knowledge)(Assuming other data required)
Finding Popular shops and etc.
In installation documentation of geomesa it requires to install Apache Maven which i did by
sudo apt install maven
Image of maven version
now there are a lot of of options for running geomesa.
Is geomesa only for distributed systems?
Is it even possible for using geomesa in my problem?
Is it a dependency?
Can i use it through python?
Also can you suggest me best choice of database for spatiotemporal data.
I downloaded geomesa-fs since i don't have any distributed property to my data.
But don't know how to use it.
GeoMesa is mainly used with distributed systems, but not always. Take a look at the introduction in the documentation for more details. For choosing a database, take a look at the Getting Started page. Python is mainly supported through PySpark. Maven is only required for building from the source code, which you generally would not need to do.
If you already have your data in MySQL, you may just want to use GeoTools and GeoServer, which support MySQL.

How to instal Hadoop tools on AWS cluster

I am new to Hadoop and big data. I have setup a 4 node working Hadoop cluster in AWS. I wanted to what are the different tools I can install on that and how to install them. My plan is to stream twitter data to HDFS and then looking for specific patterns . What are the tools available for this task.
Thanks in advance.
Raj
You can very easily see what technologies you can have available to your cluster when you request it, and AWS will take care of the installation.
Just go to EMR, create a cluster, then click on advanced options, and you will see something like this:
If you're asking which technology is best suited to your particular use case, then maybe you should post a separate question when you've figured out exactly what you're trying to do.

Setting up Hadoop in a public cloud

As a part of my college project, I would like to modify Hadoop's source code. However, the problem is that I would need atleast 20 systems to test it. Is it possible to setup this modified version of Hadoop in public clouds such as Google Cloud platform or Amazon Services?Can you give me an idea on the procedure to follow?I could only find information about setting up the original Hadoop versions in the public cloud set up. I couldn't find any information that is relevant to my case.Please do help me out.
Amazon offers elastic mapreduce. But as you correctly pointed out you will not be able to deploy your version of hadoop there.
But you still can use Amazon or Google cloud to just get the base linux servers and install your hadoop on it. It is just a longer process but not different from any other hadoop installation if you have done it before.

Cascading HBase Tap

I am trying to write Scalding jobs which have to connect to HBase, but I have trouble using the HBase tap. I have tried using the tap provided by Twitter Maple, following this example project, but it seems that there is some incompatibility between the Hadoop/HBase version that I am using and the one that was used as client by Twitter.
My cluster is running Cloudera CDH4 with HBase 0.92 and Hadoop 2.0.0-cdh4.1.3. Whenever I launch a Scalding job connecting to HBase, I get the exception
java.lang.NoSuchMethodError: org.apache.hadoop.net.NetUtils.getInputStream(Ljava/net/Socket;)Ljava/io/InputStream;
at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:363)
at org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:1046)
...
It seems that the HBase client used by Twitter Maple is expecting some method on NetUtils that does not exist on the version of Hadoop deployed on my cluster.
How do I track down what exactly is the mismatch - what version would the HBase client expect and so on? Is there in general a way to mitigate these issues?
It seems to me that often client libraries are compiled with hardcoded version of the Hadoop dependencies, and it is hard to make those match the actual versions deployed.
The method actually exists but has changed its signature. Basically, it boils down to having different versions of Hadoop libraries on your client and server. If your server is running Cloudera, you should be using the HBase and Hadoop libraries from Cloudera. If you're using Maven, you can use Cloudera's Maven repository.
It seems like library dependencies are handled in Build.scala. I haven't used Scala yet, so I'm not entirely sure how to fix it there.
The change that broke compatibility was committed as part of HADOOP-8350. Take a look at Ted Yu's comments and the responses. He works on HBase and had the same issue. Later versions of the HBase libraries should automatically handle this issue, according to his comment.

Hadoop cluster set up with 0.23 release (MRv2 or NextGen MR)

AS i see the latest stable release of hadoop is 0.20.x. And latest release is 0.23.. Seems there are lot of chanages from .20. to 0.23.x.
We are able to set up small cluster with stable relase(0.20.2) and practicising mapreduce programming.
We have seen lot of new api's added in 0.23.x. In order to explore 0.23.x, we need to setup cluster also with 0.23.x release.
Could you guys point us a documentation, where we can set up cluster with 0.23.x release.
seems 0.23.x is completely different its not like 0.20.x when i untar the tar file. Please give us some book reference/doc where cluster set up is mentioned from begining.
Thanks
MRK
The major difference between 0.23 and pre-0.23 release is that in 0.23 the resource management and the application life cycle management have been separated. Pre-0.23 allowed only MapReduce applications to run, but 0.23 allows other applications besides MapReduce. Already Hama, Giraph and some other applications have been ported and porting of MPI is in progress.
We have seen lot of new api's added in 0.23.x. In order to explore 0.23.x, we need to setup cluster also with 0.23.x release.
There hasn't been any differences in the user API, so the existing applications should run without any code changes, but configuration file changes are required. 0.23 release is backward compatible from an API perspective.
Here is the consolidated list of MRv2 architecture, videos, articles etc. I will try to keep them updated as I come across new information.
http://www.thecloudavenue.com/p/mrv2resources.html
This is the official documentation for cluster setup in r0.23.0:
http://hadoop.apache.org/common/docs/r0.23.0/hadoop-yarn/hadoop-yarn-site/ClusterSetup.html

Resources