Druid + Hadoop (for both uses, deep-store & indexing) - hadoop

If I have Hadoop server (pseudo-distributed mode) running on a separate machine, do I still need to have these files under my Druid's conf dir ? : http://druid.io/docs/latest/configuration/hadoop.html
The way I see it:
Looks like those -site.xml files are for Hadoop server..., and Druid only acts as Hadoop client. So I don't think Druid needs the hdfs-site.xml.
Core-site.xml..., ok, I can get it. I mean, Druid nees to know the IP of the name node (hadoop).
Mapred-site.xml, partially. Druid needs to know the status of mapreduce jobs (I suppose it will delegate the indexing to Hadoop as MR job). So it needs to communicate with those job trackers to see if the indexing is finished / failed / in progress. For that, it needs the URL of Hadoop JT.
However Druid does not need this prperty "mapreduce.cluster.local.dir", because it does not participate actively in MR job.
Yarn-site.xml? Maybe it should stay, partially. At least for submitting a job (?).
What about HDFS-site.xml? I think this can be scrapped completely.
Capacity-scheduler.xml? It can go.
Please correct me If I'm wrong.
These questions / doubts arises because I'm quite new to hadoop. I have my hadoop setup running. Pseudo distributed mode. I also tested it with javascript webhdfs library to write and read file. Also have tried the sample MR jobs provided by the hadoop dist. So I guess my hadoop setup is fine. I'm just a bit unsure on the Druid site, partly because the doc is not ver clear about it.
Btw.... I have hadoop 2.7.2... While the hadoop-client libs used by Druid is still on 2.3.0.
Should I downgrade my hadoop server to 2.3.0?
http://druid.io/docs/latest/operations/other-hadoop.html
Thansk,
Raka

Please add the mapred-site.xml core-site.xml hdfs-site.xml yarn-site.xml to the classpath.
Also you don't need to downgrade druid works well with 2.7.X.
As you can see in the doc you can use multiple version of hadoop.

Related

why I could use HBase without starting Hadoop/HDFS?

I am new to HBase, recently I installed HBase and tried to start it on my Mac. Everything is fine and I could play with HBase. In some articles, it said I should start Hadoop first when using HBase, I am wondering if this prerequisite changed?
Hadoop is not a hard requirement for HBase unless you are running fully distributed which you are not. Running on a single node like you are you can use the local filesystem. See HBase run modes: Standalone and Distributed for more information.
Your local filesystem (the file:// URI) is Hadoop-compatible. Hbase requires a Hadoop compatible storage layer, but that does not mean that it must literally be HDFS.
HDFS will simply provide scalability and reliability

Can apache spark run without hadoop?

Are there any dependencies between Spark and Hadoop?
If not, are there any features I'll miss when I run Spark without Hadoop?
Spark is an in-memory distributed computing engine.
Hadoop is a framework for distributed storage (HDFS) and distributed processing (YARN).
Spark can run with or without Hadoop components (HDFS/YARN)
Distributed Storage:
Since Spark does not have its own distributed storage system, it has to depend on one of these storage systems for distributed computing.
S3 – Non-urgent batch jobs. S3 fits very specific use cases when data locality isn’t critical.
Cassandra – Perfect for streaming data analysis and an overkill for batch jobs.
HDFS – Great fit for batch jobs without compromising on data locality.
Distributed processing:
You can run Spark in three different modes: Standalone, YARN and Mesos
Have a look at the below SE question for a detailed explanation about both distributed storage and distributed processing.
Which cluster type should I choose for Spark?
Spark can run without Hadoop but some of its functionality relies on Hadoop's code (e.g. handling of Parquet files). We're running Spark on Mesos and S3 which was a little tricky to set up but works really well once done (you can read a summary of what needed to properly set it here).
(Edit) Note: since version 2.3.0 Spark also added native support for Kubernetes
By default , Spark does not have storage mechanism.
To store data, it needs fast and scalable file system. You can use S3 or HDFS or any other file system. Hadoop is economical option due to low cost.
Additionally if you use Tachyon, it will boost performance with Hadoop. It's highly recommended Hadoop for apache spark processing.
As per Spark documentation, Spark can run without Hadoop.
You may run it as a Standalone mode without any resource manager.
But if you want to run in multi-node setup, you need a resource manager like YARN or Mesos and a distributed file system like HDFS,S3 etc.
Yes, spark can run without hadoop. All core spark features will continue to work, but you'll miss things like easily distributing all your files (code as well as data) to all the nodes in the cluster via hdfs, etc.
Yes, you can install the Spark without the Hadoop.
That would be little tricky
You can refer arnon link to use parquet to configure on S3 as data storage.
http://arnon.me/2015/08/spark-parquet-s3/
Spark is only do processing and it uses dynamic memory to perform the task, but to store the data you need some data storage system. Here hadoop comes in role with Spark, it provide the storage for Spark.
One more reason for using Hadoop with Spark is they are open source and both can integrate with each other easily as compare to other data storage system. For other storage like S3, you should be tricky to configure it like mention in above link.
But Hadoop also have its processing unit called Mapreduce.
Want to know difference in Both?
Check this article: https://www.dezyre.com/article/hadoop-mapreduce-vs-apache-spark-who-wins-the-battle/83
I think this article will help you understand
what to use,
when to use and
how to use !!!
Yes, of course. Spark is an independent computation framework. Hadoop is a distribution storage system(HDFS) with MapReduce computation framework. Spark can get data from HDFS, as well as any other data source such as traditional database(JDBC), kafka or even local disk.
Yes, Spark can run with or without Hadoop installation for more details you can visit -https://spark.apache.org/docs/latest/
Yes spark can run without Hadoop. You can install spark in your local machine with out Hadoop. But Spark lib comes with pre Haddop libraries i.e. are used while installing on your local machine.
You can run spark without hadoop but spark has dependency on hadoop win-utils. so some features may not work, also if you want to read hive tables from spark then you need hadoop.
Not good at english,Forgive me!
TL;DR
Use local(single node) or standalone(cluster) to run spark without Hadoop,but stills need hadoop dependencies for logging and some file process.
Windows is strongly NOT recommend to run spark!
Local mode
There are so many running mode with spark,one of it is called local will running without hadoop dependencies.
So,here is the first question:how to tell spark we want to run on local mode?
After read this official doc,i just give it a try on my linux os:
Must install java and scala,not the core content so skip it.
Download spark package
There are "without hadoop" and "hadoop integrated" 2 type of package
The most important thing is "without hadoop" do NOT mean run without hadoop but just not bundle with hadoop so you can bundle it with your custom hadoop!
Spark can run without hadoop(HDFS and YARN) but need hadoop dependency jar such as parquet/avro etc SerDe class,so strongly recommend to use "integrated" package(and you will found missing some log dependencies like log4j and slfj and other common utils class if chose "without hadoop" package but all this bundled with hadoop integrated pacakge)!
Run on local mode
Most simple way is just run shell,and you will see the welcome log
# as same as ./bin/spark-shell --master local[*]
./bin/spark-shell
Standalone mode
As same as blew,but different with step 3.
# Starup cluster
# if you want run on frontend
# export SPARK_NO_DAEMONIZE=true
./sbin/start-master.sh
# run this on your every worker
./sbin/start-worker.sh spark://VMS110109:7077
# Submit job or just shell
./bin/spark-shell spark://VMS110109:7077
On windows?
I kown so many people run spark on windown just for study,but here is so different on windows and really strongly NOT recommend to use windows.
The most important things is download winutils.exe from here and configure system variable HADOOP_HOME to point where winutils located.
At this moment 3.2.1 is the most latest release version of spark,but a bug is exist.You will got a exception like Illegal character in path at index 32: spark://xxxxxx:63293/D:\classe when run ./bin/spark-shell.cmd,only startup a standalone cluster then use ./bin/sparkshell.cmd or use lower version can temporary fix this.
For more detail and solution you can refer for here
No. It requires full blown Hadoop installation to start working - https://issues.apache.org/jira/browse/SPARK-10944

How to integrate Cassandra with Hadoop

I am trying to set up clustered Hadoop and Cassandra. Many sites I've read use a lot of words and concepts I am slowly grasping but I still need some help.
I have 3 nodes. I want to set up Hadoop and Cassandra on all 3. I am familiar with Hadoop and Cassandra individually but how so they work together and how do I configure them to work together? Also, how do I set up one node dedicated to, for example, analytics?
So far I have modified my hadoop-env.sh to point to Cassandra libs. I have put this on all of my nodes. Is that correct? What more do I need to do and how do I run it - start Hadoop cluster or Cassandra first?
Last little question: do I connect directly to Cassandra or to Hadoop from within my Java client?
Rather then connecting them via your java client, you need to install Cassandra On top of Hadoop. Please follow the article for step by step assistance.
BR

Can I use hadoop to run multiple web servers?

I am not sure about what hadoop can and cannot do, and how easy things are.
I understand hadoop is good at doing mapreduce jobs and at providing hdfs, their distributed filesystem.
What else is hadoop good at / easy to use ?
My problem : I would like to serve data, result of mapreduce. And as I have lot of traffic I would need 3 front end servers. Can Hadoop help me deploy a server on 3 of my n runnning nodes ?
Basically instead of running mapreduce on n machines, I would like to run a custom executable (my server) on 3 machines. And when 1 machine fails, that hadoop takes care of starting the job on another available machine.
Am I supposed to run that on the hadoop cluster ? or should the hadoop cluster be used only for the mapreduce and I should have a separate cloud to serve the data from the hadoop cluster ?
Thanks for sharing your experience.
P.S I am just considering hadoop right now as a solution, Im not tied to it
Your question isn't actually clear but here is my shot.
You want to display the result of your Hadoop job? Usually a Hadoop job writes its result to HDFS. What you can do is to create your own OutputFormat class. You might define a XMLOutputFormat for example.
But the nice thing is that you can create your own Writable. Take a look at Database Access with Apache Hadoop. In this tutorial you can save the output of a Hadoop job to a data base system.
Your frontend then can query the database and show the result.

HBase integration with Hadoop - sync support

I am relatively new to HBase or Hadoop and this may sound naive. However..
I have issues in the integration of Hbase with the exisitng hadoop cluster.
For the purpose of learning, i configured a 2 nodes Hadoop 1.1.1 cluster. Lets say master and slave.
I could even run the map reduce examples without any problems.
On Master --- 1. Namenode 2. Secondary Namenode 3. Job Tracker + 4. Datanode 5. Task tracker
On Salve --- 1. Datanode 2. Task Tracker
Now, i want to run HBase 0.90.6 on top of this hadoop cluster. The problem is that this version of HBase is bundled with Hadoop-code-append jar. Now to integrate HBase 0.90.6 with Hadoop 1.1.1, i have replaced the hadoop core jar in hbase lib directory with the hadoop-core-1.1.1 jar. I also have to place the commons-configuration jar under the hbase lib folder. Then make HBase point to the hadoop cluster via hbase.rootdir property under hbase-site.xml This works perfectly fine.
The problem occurs when i start the HBase master web UI and it says
"You are currently running the HMaster without HDFS append support enabled. This may result in data loss. Please see the HBase wiki for details."
When i searched for the sync support, it looks like not all versions of Hadoop support this.
Now the question is, How to get the sync support with Hbase 0.90.6 and hadoop 1.1.1 combination?
Have you turned on append support on both hbase-site.xml and hdfs-site.xml? This works for HBase 0.96.0.
<property>
<name>dfs.support.append</name>
<value>true</value>
</property>
You will have to restart the cluster after making this change.

Resources