how to install kafka in hadoop cluster - hadoop

I want to install the latest release of Kafka on my ubuntu Hadoop cluster that contains 1 master nodes and 4 data nodes.
Here are my questions:
Should kafka be installed on all the machines or only on NameNode machine?
What about zookeeper? Should it be installed on all the machines or only
on NameNode machine?
Please share required document to install kafka and Zookeeper in a Hadoop 5 node cluster

The architecture is strictly based on your requirements and on what you have: how powerful your machines are, how much data do they need to process, how many consumers do the Kafka instances need to feed, and so on. In theory you can have 1 kafka instance and 1 zookeeper, but it won't be fault-tolerant - if it fails, you lose data and so on.
You find more information about zookeeper multi-cluster here.
What I would do first is to try to analyze
how much data they need to process,
how much data they need to
"ingest",
how powerful your machines are,
how many consumers you
are going to need,
how reliable your machines are
These are just a few factors to consider before starting to build up an infrastructure. If you want to have a rough estimate based on "just" 5 machines, assuming they are all equally powerful and with a good amount of memory (e.g., 32GB per machine), is that you need is to have at least a couple of Kafka nodes and at least 3 machines for Zookeeper (2N + 1) so that if one fails, Zookeeper can handle this failure.

Related

Installing NiFi (open source) on the datanodes of an existing Hadoop cluster

If you have 10 datanodes on an existing Hadoop cluster could you install NiFi on 4 or 6 datanodes?
The main purpose of NiFi would be loading data daily from RDBMS to HDFS, high volume.
Datanodes would be configured with high RAM lets say 100GB.
External 3 node Zookeeper cluster would be used.
Are there any major concerns with this approach?
Does it make more sense to just install NiFi on EVERY datanode, so 10?
Are there any issues with having a large cluster of 10 nifi nodes?
Will some NiFi configuration best practices conflict with Hadoop config?
Edit: Currently using Hortonworks version 2.6.5 and open source NiFi 1.9.2
Are there any major concerns with this approach?
Cloudera Data platform is integrated with Cloudera Dataflow which on based on Apache NiFi, so integration should not be a concern.
Does it make more sense to just install NiFi on EVERY datanode, so 10?
Depends on what traffic you are expecting, but I would consider NiFi a standalone service, such as Kafka, Zookeeper... so a cluster of 3 would be a great start and maybe increasing if needed. Starting will all DataNodes is not required. It is ok to share these services with DataNodes, just make sure resources are allocated correctly (cores, memory, storage...) - this is easier with Cloudera.
Are there any issues with having a large cluster of 10 nifi nodes?
More info on scaling on 6) NiFi Clusters Scale Linearly. You should have a lot of traffic to go over 10 nodes.
Will some NiFi configuration best practices conflict with Hadoop
config?
That depends on how you configure it. I would advise using Cloudera for both, which is very tested to work together. You may not end up with latest versions for your services, but at least you have a higher reliability.
Even if you have an existing HDP 2.6.5 cluster, or perhaps by now you upgraded to HDP 3 or even its successor CDP, you can use the Hortonworks/Cloudera Nifi solution via your management console. So if you currently use Ambari (or its counterpart Cloudera Manager) the recommended way to install Nifi is through that.
It will be called Hortonworks Data Flow or Cloudera Data Flow respectively.
Regarding the other part of your question:
Typically it is recommended to install Nifi on dedicated nodes, and 10 nodes is likely overkill if you are not sure.
Here is some information on sizing your Nifi deployment (note that Cloudera and Hortonworks have merged, so though the site is called Cloudera this page is actually written with a HDP cluster in mind, of course that does not impact the sizing).
https://docs.cloudera.com/HDPDocuments/HDF3/HDF-3.1.1/bk_planning-your-deployment/content/ch_hardware-sizing.html
Full disclosure: I am an employee of Cloudera (formerly Hortonworks)

Is there a significant performance difference between Pseudo-Distributed and Fully-distributed mode in Hadoop?

I was reading the document of Hadoop, and I found this:
"Both standalone mode and pseudo-distributed mode are provided for the purposes of small-scale testing".
I have 2 questions.
First, how big is considered as small-scale, more specifically, I'm going to use at most 32 nodes, is this ok for me to run it in the pseudo-distributed mode?
Second, even for small-scale, is there any performance difference between Pseudo-Distributed and Fully-distributed mode? Since, I'm running hadoop on my Mac, and it's kind difficult for me to find a really cluster system. Anything that I have to pay attention?
at most 32 nodes, is this ok for me to run it in the pseudo-distributed mode?
Pseudo distributed specifically means you only have one node. It means all Hadoop services are capable of talking to each other as if they were on an external interface (not all localhost) connection, and using HDFS, not just the local filesystem.
In order to create a "distributed mode" cluster, you can add additional nodes to your single node by using the correct configurations. Tip: Apache Ambari would make this process much easier.
However, HDFS will want to be able to replicate blocks at least three times by default, and in order to accommodate for downtime in these services, 5 nodes is a good minimum. I also recommend that you setup High Availability in your cluster using a standalone installation of 3-5 Zookeeper servers

Storm-zookeeper transactional logs extremlly large

I'm using a ZooKeeper cluster (3 mchines) for my Storm cluster (4 machines). The problem is that -because of the topologies deployed on the storm cluster- the zookeeper transactional logs grow to be extremly large making the zookeeper desk to be full and what is really strange that those logs are not devided into multiple files instead I'm having one big transactional file in every zookeeper machine! making the autopurge in my zookeeper configuration not to have any affect on those files.
Is there a way to solve this problem from zookeeper side, or can I change the way storm uses zookeeper to minimize the size of those logs?
Note: I'm using zookeeper 3.6.4 and Storm 0.9.6 .
I was able to resolve this problem by using Pacemarker to process heartbeats from workers instead of zookeeper; That allowed me to avoid writting to zookeeper disk in order to maintain consistency and use in-memory store instead. In order to be able to use Pacemaker I upgraded to Storm-1.0.2.

How to allocate physical resources for a big data cluster?

I have three servers and I want to deploy Spark Standalone Cluster or Spark on Yarn Cluster on that servers.
Now I have some questions about how to allocate physical resources for a big data cluster. For example, i want to know whether i can deploy Spark Master Process and Spark Worker Process on the same node. Why?
Server Details:
CPU Cores: 24
Memory: 128GB
I need your help. Thanks.
Of course you can, just put host with Master in slaves. On my test server I have such configuration, master machine is also worker node and there is one worker-only node. Everything is ok
However be aware, that is worker will fail and cause major problem (i.e. system restart), then you will have problem, because also master will be afected.
Edit:
Some more info after question edit :) If you are using YARN (as suggested), you can use Dynamic Resource Allocation. Here are some slides about it and here article from MapR. It a very long topic how to configure memory properly for given case, I think that these resources will give you much knowledge about it
BTW. If you have already intalled Hadoop Cluster, maybe try YARN mode ;) But it's out of topic of question

what are the differences zookeeper, journal node tasks and quorum journal manager in hadoop?

On studying the material in multiple no of websites and videos, I am confused with the functionalities and differences in the purposes of the 3 hadoop components ZooKeeper, Journal Node and the Quorum Journal Manager.
Could anyone please explain me the reasons for inventing each of the above and differences in the purposes and functionalities of the above three components?
Thanks in advance.
Think of it like this, zookeeper is a group of people, each assigned to watch over a factory and coordinate them, journal node is a place where all factory managers can check others status and coordinate. QJM is a combination of both to be used in HA for better coordination in case of fail over.
zookeeper coordinates hbase regionservers and other hadoop modules which require zookeeper.
journal node coordinates hadoop datanodes with the namenode.
QJM coordinates regionservers using the technique used by journal node
on core hadoop setup only journal node is necessary in case of distributed setup
Firstly, quorum means there is a need of majority for decisions. So, when you see the word "quorum" you should think of a clustered, saying that; multi-host configuration. You can hear this term for both Zookeeper and Journal Nodes.
Short description of their functionalities will help you distinguish their purpose.
Zookeeper: Zookeeper is the central synchronisation application for informations which applications need to check frequently. There may be many informations that application need like naming structure, information, configuration information (or simply configurations) etc. Most common case is configuration of application. When you change a config which relates to lets say 80 servers, to synchronise this change to all nodes, you need to develop a synchronisation service. Application itself may have this feature. But imagine you add another 12 applications to your environment. You need to take care of each application's synchronisation service one by one. This is where zookeeper comes in. Zookeeper can handle management of all these information by itself. If you set it up as a cluster (need an odd number of hosts. why?) you will have high availability for Zookeeper (failover cases) and have a Zoopeeker Quorum.
Journal Node: In an high availability Hadoop cluster you have more than one Namenodes running in active/passive mode. Active namenode informs journal node for changes. Stand by name node asks to journal node about what changed. Like on the case of Zookeeper if you set up as cluster configuration (need odd number of hosts also here. why?), you have high availability also for Journal Node features and have a Quorum Journal Manager.
Actually I didn't hear them set as single host or node except for lab purposes (vm in pc).
1. Zookeeper
ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications
Role of Zookeeper in Hadoop ecosystem:
During the Hadoop Namenode failover process, ZooKeeper has been used to avoid split brain scenario so that name node state is not getting diverged due to failover.
Refer to this post for more details:
How does Hadoop Namenode failover process works?
2. JournalNode ( Used in Namenode failover process)
In order for the Standby node to keep its state synchronized with the Active node, both nodes communicate with a group of separate daemons called “JournalNodes” (JNs).
JournalNode machines - the machines on which you run the JournalNodes. The JournalNode daemon is relatively lightweight, so these daemons may reasonably be collocated on machines with other Hadoop daemons, for example NameNodes, the JobTracker, or the YARN ResourceManager.
Note: There must be at least 3 JournalNode daemons, since edit log modifications must be written to a majority of JNs. This will allow the system to tolerate the failure of a single machine
3.Quorum Journal Manager (QJM) allows to share edit logs between the Active and Standby NameNodes
Importantly, when using the Quorum Journal Manager, only one NameNode will ever be allowed to write to the JournalNodes, so there is no potential for corrupting the file system metadata from a split-brain scenario

Resources