hadoop access without ssh - hadoop

Is there a way to allow a developer to access a hadoop command line without SSH? I would like to place some hadoop clusters in a specific environment where SSH is not permitted. I have searched for alternatives such as a desktop client but so far have not seen anything. I will also need to federate sign on info for developers.

If you're asking about hadoop fs and similar commands, you don't need SSH for this.
You just need to download Hadoop clients and configure the hdfs-site.xml file to point at a remote cluster. However, this is an administrative security hole, so setting up an edge node that does have trusted and audited SSH access is preferred.
Similarly, Hive or HBase or Spark jobs can be ran with the appropriate clients or configuration files without any SSH access, just local libraries

You don't need SSH to use Hadoop. Also Hadoop is a combination of different stacks, which part of Hadoop are you referring to specifically? If you are talking about HDFS you can use web HDFS. If you are talking about YARN you can use API call. There are also various UI tools such as HUE you can use. Notebook apps such as Zeppelin or Jupiter can also be helpful.

Related

Why would I Kerberise my hadoop (HDP) cluster if it already uses AD/LDAP?

I have a HDP cluster.
This cluster is configured to use Active Directory as Authentication and Authorization authority. To be more specific, we use Ranger to limit accesses to HDFS directories, Hive tables and Yarn queues after said user provided correct username/password combinaison.
I have been tasked to Kerberise the Cluster, which is very easy thanks to the "press buttons and skip" like option in Ambari.
We Kerberised a test cluster. While interacting with Hive does not require any modification on our existing scripts on the cluster's machines, it is very, very difficult to find a way for end users to interact with Hive from OUTSIDE the cluster (PowerBI, DbVisualizer, PHP application).
Kerberising seems to bring an unnecessary amount of work.
What concret benefits would I get from Kerberising the cluster (except make the guys above in the hierachy happy because, hey, we Kerberised, yoohoo) ?
Edit:
One benefit:
Kerberising the Cluster grants more security as it is running on linux machines, but the company Active Directory is not able to handle such OS.
Ranger with AD/LDAP authentication and authorization is ok for external users, but AFAIK, it will not secure machine-to-machine or command-line interactions.
I'm not sure if it still applies, but on a Cloudera cluster without Kerberos, you could fake a login by setting an environment parameter HADOOP_USER_NAME on the command line:
sh-4.1$ whoami
ali
sh-4.1$ hadoop fs -ls /tmp/hive/zeppelin
ls: Permission denied: user=ali, access=READ_EXECUTE, inode="/tmp/hive/zeppelin":zeppelin:hdfs:drwx------
sh-4.1$ export HADOOP_USER_NAME=hdfs
sh-4.1$ hadoop fs -ls /tmp/hive/zeppelin
Found 4 items
drwx------ - zeppelin hdfs 0 2015-09-26 17:51 /tmp/hive/zeppelin/037f5062-56ba-4efc-b438-6f349cab51e4
For machine-to-machine communications, tools like Storm, Kafka, Solr or Spark are not secured by Ranger, but they are secured by Kerberos, so only dedicated processes can use those services.
Source: https://community.cloudera.com/t5/Support-Questions/Kerberos-AD-LDAP-and-Ranger/td-p/96755
Update: Apparently, Kafka and Solr Integration has been implemented in Ranger since then.

How to login to real time company hadoop cluster?

I am new to hadoop environment. I was joined in a company and was given KT and required documents for project. They asked me to login into cluster and start work immediately. Can any one suggest me the steps to login?
Not really clear what you're logging into. You should ask your coworkers for advice.
However, sounds like you have a Kerberos keytab, and you would run
kinit -k key.kt
There might be additional arguments necessary there, such as what's referred to as a principal, but only the cluster administrators can answer what that needs to be.
To verify your ticket is active
klist
Usually you will have Edge Nodes i.e client nodes installed with all the clients like
HDFS Client
Sqoop Client
Hive Client etc.
You need to get the hostnames/ip-addresses for these machines. If you are using windows you can use putty to login to these nodes by either using username and password or by using the .ppk file provided for those nodes.
Any company in my view will have a infrastructure team which configures LDAP with the Hadoop cluster which allows all the users by providing/adding your ID to the group roles.
And btw, are you using Cloudera/Mapr/Hortonworks? Every distribution has their own way and best practices.
I am assuming KT means knowledge transfer. Also the project document is about the application and not the Hadoop Cluster/Infra.
I would follow the following procedure:
1) Find out the name of the edge-node (also called client node) from your team or your TechOps. Also find out if you will be using some generic linux user (say "develteam") or you would have to get a user created on the edge-node.
2) Assuming you are accessing from Windows, install some ssh client (like putty).
3) Log in to the edge node using the credentials (for generic user or specific user as in #1).
4) Run following command to check you are on Hadoop Cluster:
> hadoop version
5) Try hive shell by typing:
> hive
6) Try running following HDFS command:
> hdfs dfs -ls /
6) Ask a team member where to find Hadoop config for that cluster. You would most probably not have write permissions, but may be you can cat the following files to get idea of the cluster:
core-site.xml
hdfs-site.xml
yarn-site.xml
mapred-site.xml

how users should work with ambari cluster

My question is pretty trivial but didnt find anyone actually asking it.
We have a ambari cluster with spark storm hbase and hdfs(among other things).
I dont understand how a user that want to use that cluster use it.
for example, a user want to copy a file to hdfs, run a spark-shell or create new table in hbase shell.
should he get a local account on the server that run the cooresponded service? shouldn't he use a 3rd party machine(his own laptop for example)?
If so ,how one should use hadoop fs, there is no way to specify the server ip like spark-shell has.
what is the normal/right/expected way to run all these tasks from a user prespective.
Thanks.
The expected way to run the described tasks from the command line is as follows.
First, gain access to the command line of a server that has the required clients installed for the services you want to use, e.g. HDFS, Spark, HBase et cetera.
During the process of provisioning a cluster via Ambari, it is possible to define one or more servers where the clients will be installed.
Here you can see an example of an Ambari provisioning process step. I decided to install the clients on all servers.
Afterwards, one way to figure out which servers have the required clients installed is to check your hosts views in Ambari. Here you can find an example of an Ambari hosts view: check the green rectangle to see the installed clients.
Once you have installed the clients on one or more servers, these servers will be able to utilize the services of your cluster via the command line.
Just to be clear, the utilization of a service by a client is location-independent from the server where the service is actually running.
Second, make sure that you are compliant with the security mechanisms of your cluster. In relation to HDFS, this could influence which users you are allowed to use and which directories you can access by using them. If you do not use security mechanisms like e.g. Kerberos, Ranger and so on, you should be able to directly run your stated tasks from the command line.
Third, execute your tasks via command line.
Here is a short example of how to access HDFS without considering security mechanisms:
ssh user#hostxyz # Connect to the server that has the required HDFS client installed
hdfs dfs -ls /tmp # Command to list the contents of the HDFS tmp directory
Take a look on Ambari views, especially on Files view that allows browsing HDFS

What does "Client" exactly mean for Hadoop / HDFS?

I understand the general concept behind it, but I would like more clarification and a clear-cut definition of what a "client" is.
For example, if I just write an hdfs command on the Terminal, is it still a "client" ?
Client in Hadoop refers to the Interface used to communicate with the Hadoop Filesystem. There are different type of Clients available with Hadoop to perform different tasks.
The basic filesystem client hdfs dfs is used to connect to a Hadoop Filesystem and perform basic file related tasks. It uses the ClientProtocol to communicate with a NameNode daemon, and connects directly to DataNodes to read/write block data.
To perform administrative tasks on HDFS, there is hdfs dfsadmin. For HA related tasks, hdfs haadmin.
There are similar clients available for performing YARN related tasks.
These Clients can be invoked using their respective CLI commands from a node where Hadoop is installed and has the necessary configurations and libraries required to connect to a Hadoop Filesystem. Such nodes are often referred as Hadoop Clients.
For example, if I just write an hdfs command on the Terminal, is it
still a "client" ?
Technically, Yes. If you are able to access the FS using the hdfs command, then the node has the configurations and libraries required to be a Hadoop Client.
PS: APIs are also available to create these Clients programmatically.
Edge nodes are the interface between the Hadoop cluster and the outside network. This node/host will have all the libraries and client components present, as well as current configuration of the cluster to connect to the hdfs.
This thread discusses same

No passwd entry for user 'hdfs'

I trying to set up a hive environment on my google compute engine hadoop clusters which was deployed from one click deployment.
When I try to switch to hdfs user(su hdfs), I get below error message.
No passwd entry for user 'hdfs'
The "one-click deployment" is an older sample which perhaps showcases installation from shell scripts and tarballs, but isn't intended for use as a supported Hadoop service, and doesn't set up typical Hadoop installation configurations like an hdfs user or adding commands to /usr/bin.
If you want a more Hadoop (and Pig+Hive+Spark) specialized service, you may want to consider using Google Cloud Dataproc, which is Google's managed Hadoop solution. You can create clusters from the cloud console UI in Dataproc just like click-to-deploy, and you'll get a more fully installed Hadoop/Hive environment, including a per-cluster persistent MySQL-based Hive metastore which is shared with SparkSQL to make it easy to play with Spark without modifying your Hive environment if you so choose.

Resources