Ways to enable communication between kerberized and non kerberized environments - hadoop

We have a non Kerberized Hortonworks cluster which needs to access services in a Kerberized Cloudera cluster.
Which are the ways in which the non Kerberized cluster can communicate with the kerberized cluster?
Can we
Configure the KDC in the Kerberized cluster to be the common KDC?
Kerberize the Hortonworks cluster by installing and configuring Kerberos, create SPNs and UPNs etc.,?

"Which are the ways in which the non Kerberized cluster can communicate with the kerberized cluster" generally there are none (with exceptions -see below) .. once you kerberize a cluster, it becomes a "secure" cluster that would require Kerberos authentication to talk to many of that cluster resources. If another (source) cluster where you're making requests is not kerberized, it'll not even have a kerberos ticket to authenticate in the other cluster.
Although certain services can control authentication separately. For HBase, those are hbase.security.authentication and hbase.rest.authentication.type (each one can be simple or kerberos). Which ones you're trying to use? Hive doesn't have an equivalent for these HBase settings. Solr does, for example, see "Solr Secure Authentication". Etc (I didn't go through all services)
So certain things can be relaxed for authentication, but it would be for all access not just from that non-kerberized cluster.
What you're looking for might need custom application in between if you'd like to have access from a non-secure to a secure cluster.

If you do not use self-written applications which are able to deal with different configuration properties, I do not see a way to setup a communication between the two clusters. As #Tagar already mentioned it will be necessary to maintain the a similar configuration.

Related

Adding Hbase service in kerberos enabled CDH cluster

I have a CDH cluster already running with kerberos authentication.
I have a requirement to add HBase service to the running cluster.
Looking for a documentation to enable hbase service since its kerberos enabled. Both command line and GUI options welcome.
Also, its good if there is a testing method like small table creation steps like that.
Thanks in advance!
If you add it through Coudera Manager-Add Service wizards, CDH takes care automatically (create/distribute Kerberos keytabs and add services)

Does hadoop itself contains fault-tolerance failover functionality?

I just installed new version of hadoop2, I wish to know if I config a hadoop cluster and it's brought up, how can I know if data transmission is failed, and there's a need for failover?
Do I have to install other components like zookeeper to track/enable any HA events?
Thanks!
High Availability is not enabled by default. I would highly encourage you to read the Hadoop documentation from Apache. (http://hadoop.apache.org/) It will give an overview of the architecture and services that run on a Hadoop cluster.
Zookeeper is required for many Hadoop services to coordinate their actions across the entire Hadoop cluster, regardless of the cluster being HA or not. More information can be found in the Apache Zookeeper documentation (http://zookeeper.apache.org/).

Spark on Mesos reading from a secured HDFS cluster

We have deployed Spark on Mesos and we are having problems when we are trying to read from HDFS.
Out of the box we tried using Kerberos, which works fine in local mode, but fails when running on Mesos (even though we made sure that each machine had a valid token).
My question is, what alternatives do we have?
According to this document the only two options are Kerberos and shared secret, from which Kerberos only works on YARN.
I've not found any article how a shared secret could be set up on HDFS.

How to intergrate hadoop using ambari without HDP?

I have a hadoop cluster with apache hadoop 2.0.7.
I want to know how to integrate Ambari with the apache hadoop without the HDP(HortonWorks).
Actually, If I use HDP the solution is easy. but , I don't want to use the in my situation.
Do you have an any Idea?
Ambari relies on 'Stack' definitions to describe what services the Hadoop cluster consists of. Hortonworks defined a custom Ambari stack, its called HDP.
You could define your own stack and use any services and respective versions that you wanted. See the ambari wiki for more information about defining stacks and services.
That being said, I don't think it's possible to use your pre-existing installation of Hadoop with Ambari. Ambari is used to provision and manage hadoop clusters. It keeps track of the state of each of its stacks services, and the states of each services components. Since your cluster is already provisioned it would be difficult (maybe impossible) to add it to an Ambari instance.

AWS EMR Hadoop Administration

We are currently using Apache Hadoop (Vanilla Version) in our org. We are planning to migrate to AWS EMR. I'm trying to understand how AWS EMR Hadoop works internally (not how to use it), I'm mainly interested in Hadoop administration steps and how master and slave communicates and various configuration configurations. I already checked the AWS EMR documentation but I don't see detailed comparison.
Can someone recommend me a link/tutorial for migrating to AWS EMR from an Apache Hadoop.
During EMR cluster creation, it will ask you to specify Master and Node. a default settings will provision 1 master and two nodes for you. You can also specify what all applications you want to be in the cluster (e.g.: hadoop, hive, spark, zeppelin, hue, etc.).
Once the cluster is created, it will provision all the services. you can click on these services and access them via web, or using ssh into the master. For e.g: to access the ambari interface, go to the service within EMR and click it. a new window will be launched with the ambari monitoring service interface.
Installing these applications is very easy. all you have to do is specify all the services while cluster creation.
Amazon Elastic MapReduce uses a mostly standard implementation of Hadoop and associated tools.
See: AMI Versions Supported in Amazon EMR
The benefits of using EMR are in the automated deployment of instances. For example, launching a cluster with an appropriate AMI means that software is already loaded on each instance and HDFS is configured across the core nodes.
The Master and Slave (Core/Task) nodes communicate in exactly the normal way that they communicate in any Hadoop cluster. However, only one Master is supported (with no backup Master).
When migrating to EMR, check that you are using compatible versions of software (eg Hadoop, Hive, Pig, Impala, etc). Also consider using Amazon S3 for storage of data instead of HDFS, especially for storing source data, since data on S3 persists even after the EMR cluster is terminated.
Technically, Hadoop provided with EMR, can be few releases back. You should check EMR release notes for detailed application provided with each version. EMR takes care application provisioning, setup and configuration. Based on EC2 instance type, Hadoop (and other application configuration) will change. You can override default settings using configure application.
Other than this Hadoop you have on premises and EMR should be the same.

Resources