Understanding Spark alongside Hadoop - hadoop

In the set up I have, both Hadoop and Spark are running on the same network but on different nodes. We can run Spark alongside your existing Hadoop cluster by just launching it as a separate service. Will it show any improvement in performance?
I have thousands of files around 10 GB loaded into HDFS.
I have 8 nodes for Hadoop, 1 master and 5 workers for Spark

As long as the worker is on the same node, we have the advantage of Locality. You can launch your service alongside hadoop as well.

Related

Repartitioning in Hadoop Distributed File System ( HDFS )

Is there a way to repartition data directly in HDFS? If You notice that Your partitions are unbalanced (one or more is much bigger then other) how You can deal with it?
I know that it could be done ex in Apache Spark but running a job to just repartition seems like overhead- or maybe it is good idea?
Run hdfs balancer. This tool that distributes HDFS blocks evenly across datanodes.
https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/HDFSCommands.html#balancer
In case you are running a Cloudera Manager or Ambari managed distribution, you can run HDFS balancer from their web UI.

Do I need to use Spark with YARN to achieve NODE LOCAL data locality with HDFS?

Do I need to use Spark with YARN to achieve NODE LOCAL data locality with HDFS?
If I use Spark standalone cluster manager and have my data distributed in HDFS cluster, how will Spark know that data is located locally on the nodes?
YARN is a resource manager. It deals with memory and processes, and not with the workings of HDFS or data-locality.
Since Spark can read from HDFS sources, and the namenodes & datanodes take care of all that HDFS block data management outside of YARN, then I believe the answer is no, you don't need YARN. But you already have HDFS, which means you have Hadoop, so why not take advantage of integrating Spark into YARN?
the stand alone mode has its own cluster manager/resource manager which will talk to name node for locality. client/driver will put tasks based on the results.

Replication vs Snapshot in HBase

We have two systems- One Offline system(Performance is not critical here), where the MapReduce jobs run on the HBase Cluster. The Other is the Online System(Performace is very critical here), where the API reads from the same HBase Cluster. But As the MapReduce jobs running on the same cluster, there are performance issues on the online system. So we are trying to set up separate HBase cluster for Offline system which is the replication of few family names from the Source cluster.
So on the source heavy MapReduce job runs. On the replicated cluster only online system runs giving the best performance.
My Question here is :: Cant we use Snap shot feature in HBase for doing the Same? I also wanted to know what is the difference between them?
If you use snapshot feature for mapreduce, it will also spend cpu, memory and disk io on live hbase cluster nodes too. So if disk io or cpu is the bottleneck for you, a seperate cluster for mapreduce jobs is better solution.

hadoop yarn resource management

I have a Hadoop cluster with 10 nodes. Out of the 10 nodes, on 3 of them, HBase is deployed. There are two applications sharing the cluster.
Application 1 writes and reads data from hadoop HDFs. Application 2 stores data into HBase. Is there a way in yarn to ensure that hadoop M/R jobs launched
by application 1 do not use the slots on Hbase nodes? I want only the Hbase M/R jobs launched by application 2 to use the HBase nodes.
This is needed to ensure enough resources are available for application 2 so that the HBase scans are very fast.
Any suggestions on how to achieve this?
if you run HBase and your applications on Yarn, the application masters (of HBase itself and the MR Jobs) can request the maximum of available resources on the data nodes.
Are you aware of the hortonworks project Hoya = HBase on Yarn ?
Especially one of the features is:
Run MR jobs while maintaining HBase’s low latency SLAs

How many HBase servers should I have per Hadoop server?

I have a system which will feed smaller image files which are stored in an HBase table which uses hadoop for the file system.
I have 2 instances of hadoop currently and 1 instance of HBase, but my question is what should the ratio here be? SHould I have 1 hadoop per hbase server or does it really matter?
Answer is it depends.
It depends how much data you have, cpu utilization of regionserver and various other factors. You need to do some Proof of concepts to realise the sizing of your hadoop and hbase cluster. Variability of using hadoop and hbase depends on use-cases.
As a matter of fact, I have recently seen a setup where hadoop and hbase cluster totally decoupled. In the setup hbase cluster remotely uses hadoop to R/W on HDFS.

Resources