How to login to real time company hadoop cluster? - hadoop

I am new to hadoop environment. I was joined in a company and was given KT and required documents for project. They asked me to login into cluster and start work immediately. Can any one suggest me the steps to login?

Not really clear what you're logging into. You should ask your coworkers for advice.
However, sounds like you have a Kerberos keytab, and you would run
kinit -k key.kt
There might be additional arguments necessary there, such as what's referred to as a principal, but only the cluster administrators can answer what that needs to be.
To verify your ticket is active
klist

Usually you will have Edge Nodes i.e client nodes installed with all the clients like
HDFS Client
Sqoop Client
Hive Client etc.
You need to get the hostnames/ip-addresses for these machines. If you are using windows you can use putty to login to these nodes by either using username and password or by using the .ppk file provided for those nodes.
Any company in my view will have a infrastructure team which configures LDAP with the Hadoop cluster which allows all the users by providing/adding your ID to the group roles.
And btw, are you using Cloudera/Mapr/Hortonworks? Every distribution has their own way and best practices.

I am assuming KT means knowledge transfer. Also the project document is about the application and not the Hadoop Cluster/Infra.
I would follow the following procedure:
1) Find out the name of the edge-node (also called client node) from your team or your TechOps. Also find out if you will be using some generic linux user (say "develteam") or you would have to get a user created on the edge-node.
2) Assuming you are accessing from Windows, install some ssh client (like putty).
3) Log in to the edge node using the credentials (for generic user or specific user as in #1).
4) Run following command to check you are on Hadoop Cluster:
> hadoop version
5) Try hive shell by typing:
> hive
6) Try running following HDFS command:
> hdfs dfs -ls /
6) Ask a team member where to find Hadoop config for that cluster. You would most probably not have write permissions, but may be you can cat the following files to get idea of the cluster:
core-site.xml
hdfs-site.xml
yarn-site.xml
mapred-site.xml

Related

hadoop access without ssh

Is there a way to allow a developer to access a hadoop command line without SSH? I would like to place some hadoop clusters in a specific environment where SSH is not permitted. I have searched for alternatives such as a desktop client but so far have not seen anything. I will also need to federate sign on info for developers.
If you're asking about hadoop fs and similar commands, you don't need SSH for this.
You just need to download Hadoop clients and configure the hdfs-site.xml file to point at a remote cluster. However, this is an administrative security hole, so setting up an edge node that does have trusted and audited SSH access is preferred.
Similarly, Hive or HBase or Spark jobs can be ran with the appropriate clients or configuration files without any SSH access, just local libraries
You don't need SSH to use Hadoop. Also Hadoop is a combination of different stacks, which part of Hadoop are you referring to specifically? If you are talking about HDFS you can use web HDFS. If you are talking about YARN you can use API call. There are also various UI tools such as HUE you can use. Notebook apps such as Zeppelin or Jupiter can also be helpful.

how users should work with ambari cluster

My question is pretty trivial but didnt find anyone actually asking it.
We have a ambari cluster with spark storm hbase and hdfs(among other things).
I dont understand how a user that want to use that cluster use it.
for example, a user want to copy a file to hdfs, run a spark-shell or create new table in hbase shell.
should he get a local account on the server that run the cooresponded service? shouldn't he use a 3rd party machine(his own laptop for example)?
If so ,how one should use hadoop fs, there is no way to specify the server ip like spark-shell has.
what is the normal/right/expected way to run all these tasks from a user prespective.
Thanks.
The expected way to run the described tasks from the command line is as follows.
First, gain access to the command line of a server that has the required clients installed for the services you want to use, e.g. HDFS, Spark, HBase et cetera.
During the process of provisioning a cluster via Ambari, it is possible to define one or more servers where the clients will be installed.
Here you can see an example of an Ambari provisioning process step. I decided to install the clients on all servers.
Afterwards, one way to figure out which servers have the required clients installed is to check your hosts views in Ambari. Here you can find an example of an Ambari hosts view: check the green rectangle to see the installed clients.
Once you have installed the clients on one or more servers, these servers will be able to utilize the services of your cluster via the command line.
Just to be clear, the utilization of a service by a client is location-independent from the server where the service is actually running.
Second, make sure that you are compliant with the security mechanisms of your cluster. In relation to HDFS, this could influence which users you are allowed to use and which directories you can access by using them. If you do not use security mechanisms like e.g. Kerberos, Ranger and so on, you should be able to directly run your stated tasks from the command line.
Third, execute your tasks via command line.
Here is a short example of how to access HDFS without considering security mechanisms:
ssh user#hostxyz # Connect to the server that has the required HDFS client installed
hdfs dfs -ls /tmp # Command to list the contents of the HDFS tmp directory
Take a look on Ambari views, especially on Files view that allows browsing HDFS

Check permission in HDFS

I'm totally new in Hadoop. One of SAS users has problem to save a file from SAS Enterprise Guide to Hadoop and I've been asked to check permissions in HDFS that if they have been granted properly. Somehow to make sure users are allowed to move from one side and to add it to the other side.
Where should I check for it on SAS servers? If it is a file or how can I check it?
Your answer with details would be more appreciated.
Thanks.
This question is to vague, but I can offer a few suggestions. First off, the SAS Enterprise Guide user should have a resulting SAS log from his job with any errors.
The Hadoop cluster distribution, version, services being used (For example Knox, Sentry, or Ranger security products must be setup), and authentication (kerberos) all make a difference. I will assume you are not having kerberos issues nor are running Knox, Sentry, Ranger ect, and you are using core hadoop with no Kerberos. If you need help with those you must be more specific.
1. You have to check permissions on the hadoop side for this. You have to know where they are putting the data into hadoop. These are paths in HDFS, not the servers file system.
If connecting to hive, and not specifying any options it is likely /user/hive/warehouse, or /user/username folder.
2 - Hadoop Stickybit enabled by default prevents users from writing to /tmp in HDFS. Some SAS Programs write to /tmp folder in hdfs to save metadata, along with other information.
Run the following command on a Hadoop node to check basic permissions in HDFS.
hadoop fs -ls /
You should see the /tmp folder along with permissions, if the /tmp folder has a "t" at the end the sticky bit is set such as drwxrwxrwt. If the permissions are drwxrwxrwx then sticky bit isn't set, which is good to eliminate permissions issues.
If you have a sticky bit set on /tmp, which is usually by default then you must either remote it, or set an HDFS TEMP directory in the SAS Programs libname for Hadoop cluster.
Please see the following SAS/Access to Hadoop Guide about the libname options at SAS/ACCESS® 9.4 for Relational Databases: Reference, Ninth Edition | LIBNAME Statement Specifics for Hadoop
To remove/change the Hadoop sticky bit see the following article, or from your Hadoop vendor. Configuring Hadoop Security in CDH 5 Step 14: Set the Sticky Bit on HDFS Directories . You will want to do the opposite of this article to remove the stickybit though.
2 - SAS + Authentication + Users -
If your Hadoop cluster is secured using Kerberos then each SAS user much have a valid kerberos ticket to talk to any Hadoop service. There are a number of guides on the SAS Hadoop Support page about Kerberos along with other resources. With kerberos they need a kerberos ticket, not a username or password.
SAS 9.4 Support For Hadoop Reference
If you are not using kerberos then you can either have either the Hadoop default of no authentication, or possibly some services such as Hive could have LDAP enabled.
If you don't have LDAP enabled then you can use any Hadoop username in the libname statement to connect such as hive, hdfs, or yarn. You do not need to enter any password, and this user doesn't have to be the SAS User Account. This is because they default Hadoop configuration does not require authentication. You can use another account such as one you might create for the SAS User in your Hadoop cluster. If you do this you must create a /user/username folder in HDFS by running something like the following as the HDFS superuser, or one with permissions in Hadoop then set the ownership to the user.
hadoop fs -mkdir /user/sasdemo
hadoop fs -chown sasdemo:sasusers /user/sasdemo
Then you can check to make sure it exists with
hadoop fs -ls /user/
Basically whichever user they have in their libname statement in their SAS program must have a users home folder in hadoop. The Hadoop users will have one created by default on install but you will need to create them for any additional users.
If you are using LDAP with Hadoop (not to common from what I've seen) then you will have to have the LDAP username along with a password for the user account in the libname statement. I believe you can encode the password if you like.
Testing Connections to Hadoop from SAS Program
You can modify the following SAS code to do a basic test to put one of the sashelp datasets into Hadoop using a serial connection to HiveServer2 using SAS Enterprise Guide. This is only a very basic test but should prove you can write to Hadoop.
libname myhive hadoop server=hiveserver.example.com port=10000 schema=default user=hive;
data myhive.cars;set sashelp.cars;run;
Then if you want you can use the Hadoop client of your choice to find the data in Hadoop in the location you stored it, likely /user/hive/warehouse.
hadoop fs -ls /user/hive/warehouse
And/Or you should be able to run a proc contents in SAS Enterprise Guide to display the contents of the Hadoop Hive table you just put into Hadoop.
PROC CONTENTS DATA=myhive.cars;run;
Hope this helps, good luck!
To find the proper groups who can access files in the HDFS, we need to check the Sentry.
The file ACL's are described in the Sentry, so if you want to give/revoke access to anyone, it can be done through it.
On the left hand side is the file location and right hand side is the ACL's of the groups.

Hadoop User Addition in Secured Cluster

We are using a kerborized CDH cluster. While adding a user to the cluster, we used to add the user only to the gateway/edge nodes as in any hadoop distro cluster. But with the newly added userIDs, we are not able to execute map-reduce/yarn jobs and throwing "user not found" exception.
When I researched through this, I came across a link https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/SecureContainer.html , which says to execute the yarn jobs in the secured cluster, we might need to have the corresponding user in all the nodes as the secure containers execute under the credentials of the job user.
So we added the corresponding userID to all the nodes and the jobs are getting executed.
If this is the case and if the cluster has around 100+ nodes, user provisioning for each userID would became a tedious job.
Can anyone please suggest any other effective way, if you came across the same scenario in your project implementation?
There are several approaches ordered by difficulty (from simple to painful).
One is to have a job-runner user that everyone uses to run jobs.
Another one is to use a configuration management tool to sync /etc/passwd and /etc/group (chef, puppet) on your cluster at regular intervals (1 hr - 1 day) or use a cron job to do this.
Otherwise you can buy or use open source Linux/UNIX user mapping services like Centrify (commercial), VAS (commercial), FreeIPA (free) or SSSD (free).
If you have an Active Directory server or LDAP server use the Hadoop LDAP user mappings.
References:
https://community.hortonworks.com/questions/57394/what-are-the-best-practises-for-unix-user-mapping.html
https://www.cloudera.com/documentation/enterprise/5-9-x/topics/cm_sg_ldap_grp_mappings.html

hadoop single cluster user

I am reading this document here:
http://hadoop.apache.org/docs/r2.4.0/hadoop-project-dist/hadoop-common/SingleCluster.html#Pseudo-Distributed_Operation
It has this item:
Make the HDFS directories required to execute MapReduce jobs:
$ bin/hdfs dfs -mkdir /user
$ bin/hdfs dfs -mkdir /user/<username>
It is not clear to me what <username> here should be.
Is this the Linux dedicated user which I created for Hadoop or something else?
I am beginner at Hadoop, just installed it today
and I am just trying to play a few basic examples.
Short Answer: It doesn't have to be any username, it's just whatever you choose to call the directory in HDFS where you want to put your output. But using /user/<username> is convention and good practice.
Long-Winded Answer:
Peter, think of the "Hadoop username" merely as a way to keep your stuff in HDFS distinct from that of anyone else who's also using the same Hadoop cluster. It's really just the name of a directory that you're creating or using under /user in HDFS. You don't necessarily have to "log in" to use Hadoop, but very often the hadoop username just mimics your standard username/profile.
For example, at my previous employer, everyone's logins (for email address, chat client, accessing applications, connecting to servers, developing code, etc. -- pretty much anything at work that ever required a username & password) were in the format of <firstname.lastname>, so we'd log in to everything that way. Most of us had execution privileges to our grid, so we would ssh to an appropriate server (e.g. $ssh trevor.allen#server-of-awesomeness), where we had permission to execute MapReduce jobs to the grid. Just like my user was always first.last on my own machine, as well as on all our Linux servers (e.g. home in /home/trevor.allen/), we would follow this precedent in HDFS as well, pointing any output to HDFS to /user/first.last. Of course, since the "username" was arbitrary (really just the name of a directory), you'd occasionally see typos (/user/john.deo) or someone got mixed up between Linux's usr convention and Hadoop's user convention (/user/john.doe vs /usr/john.doe), and just random dropping of last names (/user/john), and so on.
Hope that helps!
The username corresponds to a user in HDFS. So here you can create a the same user as your linux account or others. For example if you install hive, spark or Hbase, you will have to create their directories in order to running this services.
user name here is the one you used to login to hadoop .by default its a user account name.

Resources