HMaster vs Zookeeper - HBase - hadoop

I have been doing a lot of reading about HBase lately and I am little confused as to the role of HMaster and Zookeeper in the architecture of HBase.
When a client requests for data, who gets that request? Assuming this is the first request. I understand subsequent requests can be directly made to region servers. But for that to happen, locations of meta files need to be retrieved and then a get or scan needs to run on the specific meta table in the region server.
The reason I ask is, if I am using Java I would use HConnectionManager class to create a connection. It looks like HConnectionManager already has a cache of region locations available. The reason the cache is built will be when some number of requests are made earlier, but what if the cache isn't there and this is the first request.
Who takes the first HBase request, will it be the zookeeper quorum? We are submitting the hbase-site.xml file for the HBaseConfiguration class.
Also I am a little confused about how do we define a "client"?
The other thing I read was the meta information gets cached on the "client", is this true even in case of HBase REST? Will the client here be the HMaster or the actual user who is making the REST call. If so doesn't it expose a security threat if metadata is exposed to client.

Clients connect to ZooKeeper to get the latest state. The HBaseMaster role is to make sure this list is correct (i.e. assign regions to regionservers on startup, failures etc.). Clients will contact the HBaseMaster only for admin purposes e.g. creating a table, changing its structure etc. (via HBaseAdmin class). You can read more about it here.
In case of HBase REST the client sends REST request to the REST server which holds an HBase client internally

Following answer base on HBase-1.0.1.1:
1.When a client requests for data, who gets that request?
a)look up zookeeper for hbase:meta region location and cache meta region location for future.
b)scan hbase:meta in region server and get region location we need.Client also cache the region location.
c)request the region server.
2.Who takes the first HBase request, will it be the zookeeper quorum?
Yes if all is first,else may be meta region or user table region.
3.security
You can use kerberos.It is supported for HBase.

Related

Can I use a SnappyData JDBC connection with only a Locator and Server nodes?

SnappyData documentation and architecture diagrams seem to indicate that a JDBC thin client connection goes from a client to a Locator and then it is routed to a direct connection to a Server.
If this is true, then I can run JDBC queries without a Lead node, correct?
Yes, that is correct. The locator provides load and connectivity information back to the client that is now able to connect to one or more servers either for direct access to a bucket for low latency queries but more importantly, is HA - can failover and failback.
So, yes, your connected clients will continue to function even when the locator goes away. Note that the "lead" plays a different role than the locator. Its primary function is to host Spark driver, orchestrate Spark Jobs and provide HA to Spark. With no lead, you won't be able to run such Jobs.
In addition to what #jagsr has mentioned, if you do not intend to run the lead nodes (and thus no Spark jobs or column store), then you can run the cluster as pure row store using snappy-start-all.sh rowstore (see rowstore docs)

Does Apache Kafka Store the messages internally in HDFS or Some other File system

We have a project requirement of testing the data at Kafka Layer. So JSON files are moving into hadoop area and kafka is reading the live data in hadoop(Raw Json File). Now I have to test whether the data sent from the other system and read by kafka should be same.
Can i validate the data at kafka?. Does kafka store the messages internally on HDFS?. If yes then is it stored in a file structure similar to what hive saves internally just like a single folder for single table.
Kafka stores data in local files (ie, local file system for each running broker). For those files, Kafka uses its own storage format that is based on a partitioned append-only log abstraction.
The local storage directory, can be configured via parameter log.dir. This configuration happens individually for each broker, ie, each broker can use a different location. The default value is /tmp/kafka-logs.
The Kafka community is also working on tiered-storage, that will allow brokers to no only use local disks, but to offload "cold data" into a second tier: https://cwiki.apache.org/confluence/display/KAFKA/KIP-405%3A+Kafka+Tiered+Storage
Furthermore, each topic has multiple partitions. How partitions are distributed, is a Kafka internal implementation detail. Thus you should now rely on it. To get the current state of your cluster, you can request meta data about topics and partitions etc. (see https://cwiki.apache.org/confluence/display/KAFKA/Finding+Topic+and+Partition+Leader for an code example). Also keep in mind, that partitions are replicated and if you write, you always need to write to the partition leader (if you create a KafkaProducer is will automatically find the leader for each partition you write to).
For further information, browse https://cwiki.apache.org/confluence/display/KAFKA/Index
I think you can, but you have to do that manually. You can let kafka sink whatever output to HDFS. Maybe my answer is a bit late and this 'confluent' reference appeared after that, but briefly one can do the followings:
Assuming you have all servers are running (check the confluent
website)
Create your connector:
name=hdfs-sink
connector.class=io.confluent.connect.hdfs.HdfsSinkConnector
tasks.max=1
topics='your topic'
hdfs.url=hdfs://localhost:9000
flush.size=3
Note: The approach assumes that you are using their platform
(confluent platform) which I haven't use.
Fire the kafka-hdfs streamer.
Also you might find more useful details in this Stack Overflow discussion.
This happens with most of the beginner. Let's first understand that component you see in Big Data processing may not be at all related to Hadoop.
Yarn, MapReduce, HDFS are 3 main core component of Hadoop. Hive, Pig, OOOZIE, SQOOP, HBase etc work on top of Hadoop.
Frameworks like Kafka or Spark are not dependent on Hadoop, they are independent entities. Spark supports Hadoop, like Yarn, can be used for Spark's Cluster mode, HDFS for storage.
Same way Kafka as an independent entity, can work with Spark. It stores its messages in the local file system.
log.dirs=/tmp/kafka-logs
You can check this at $KAFKA_HOME/config/server.properties
Hope this helps.

Dynamic Partionitioning in TazyGrid Cluster

When adding a new server or removing one, does TazyGrid (Open Source) partitions data at runtime or do we have to pre-plan & do some pre-deployement strategies ?
If some pre-deployment planning is required, can some one please link to any resources?
Yes TayzGrid is completely dynamic and no pre-palnning is required.
Simply add a node to the cluster or remove it & data balancing will be done automatically.
Peer to Peer Dynamic Cluster
Runtime Discovery within Cluster
Runtime Discovery by Clients
Failover Support: Add/Remove Servers at Runtime
and more from TayzGrid website
Data distribution map: The data distribution map is provided only if the caching topology is Partitioned Cache or Partition-Replica Cache. This information is useful for clients to determine the location of data in the cluster. The client can therefore directly access the data from the cache server. The data distribution map is provided in two cases:
(1) at the time of the client’s connection to the server and
(2) if when any change occurs in the partitioning map because a new server has been added or removed from the cluster.

Solutions for a secure distributed cache

Problem: I want to cache user information such that all my applications can read the data quickly, but I want only one specific application to be able to write to this cache.
I am on AWS, so one solution that occurred to me was a version of memcached with two ports: one port that accepts read commands only and one that accepts reads and writes. I could then use security groups to control access.
Since I'm on AWS, if there are solutions that use out-of-the box memcached or redis, that'd be great.
I suggest you use ElastiCache with one open port at 11211(Memcached)then create an EC2 instance, set your security group so only this server can access to your ElastiCache cluster. Use this server to filter your applications, so only one specific application can write to it. You control the access with security group, script or iptable. If you are not using VPC, then you can use cache security group.
I believe you can accomplish this using Redis (instead of Memcached) which is also available via ElastiCache. Once the instance has been created, you will want to create a replication group and associate it to the cache cluster you already launched.
You can then add instances to the replication group. Instances within the replication group are simply replicated from the Master Cache Cluster (single Redis instance) and so are (by default) read-only.
So, in this setup, you have a master node (single endpoint) that you can write to and as many read nodes (multiple endpoints) as you would like.
You can take security a step further and assign different routing rules to the replication group (via the VPC) so the applications reading data does not have access to the master node (the only one that can write data).

HBase NoServerForRegionException?

I am getting this exception when for a while i didn't communicated with HBase:
org.apache.hadoop.hbase.client.NoServerForRegionException: Timed out trying to locate root region because: Connection refused
is this something related with session expiry, if so, how can i extend session lifetime?
Query bin/hbase hbck and find in which machine root Regionserver is running..
You should get -ROOT- is okay on hbck. Make sure that all your
Regionserver is up and running.
use start regionserver for starting regionserver
I don't think this has anything to do with session lifetime.
Check your cluster to make sure that it is up and working correctly and all region servers are alive. Then check the logs to make sure that they are not reporting some error state.
HBase is complex software -- without more detailed information it is very difficult to diagnose what is going on. And often you can discover the problem by collecting the more detailed information.
This error shows that the client is not able to talk to Region server.
Check the region server associated with the region its trying to connect and check its up.
To identify the region server associated with the region please go through http://hbase.apache.org/0.94/book/regions.arch.html#regions.arch.assignment
Some factors have played a role here.
Please note the below steps which occur when you try to connect to Hbase from a client,
Hbase connects to Zookeeper to get the Ip of the regionservers which host the ROOT table.
The client caches this information about the IP's so that it doesnt have to contact the zookeeper again.
Your problem is that, your client is trying to connect to the zookeeper to get the IP. one of the below things may be going wrong,
Your client is not able to connect to the zookeeper.
The information about the ROOT contained inside the Znode in ZooKeeper is wrong.
Possible fixes.
Check if your zookeeper is working fine.
Delete the Znode for Hbase in your Zookeeper and restart the cluster. Don't worry, this wont delete your data.
Once this is achieved? the client can get the ROOT information and then query for the META table without any issue.

Resources