Ranger syncing with unix users - hadoop

I have created two empty groups on two differentt nodes of my cluster, just one on each node. My ranger service uses unix user synchronization, when I restart the Ranger service I cant see my added groups to cluster nodes in Ranger UI, I use HDP 2.5. How to sync my ranegr with unix users?

As you try to sync users, you already seem to understand that there are users for the os, and users for the hadoop platform.
Typically OS users are admin/ops people who need to manage the environment, while most platform users are engineers, analysts, and other who want to do something on the platform. This large group of users is something you typically want to sync.
As already indicated by #cricket you can integrate with LDAP/AD as explained here:
https://community.hortonworks.com/articles/105620/configuring-ranger-usersync-with-adldap-for-a-comm.html

Related

can databricks cluster be shared across workspace?

My ultimate goal is to differentiate/manage the cost on databricks (azure) based on different teams/project.
And I was thinking whether I could utilize workspace to achieve this.
I read below , it sounds like workspace can access a cluster, but does not say whether multiple workspace can access the same cluster or not.
A Databricks workspace is an environment for accessing all of your Databricks assets. The workspace organizes objects (notebooks, libraries, and experiments) into folders, and provides access to data and computational resources such as clusters and jobs.
In other words, can I creat a cluster and somehow ensure can be only accessed by certain project or team or workspace?
To manage whom can access a particular cluster, you can make use of cluster access control. With cluster access control, you can determine what users can do on the cluster. E.g. attach to the cluster, the ability to restart it or to fully manage it. You can do this on a user level but also on a user group level. Note that you have to be on Azure Databricks Premium Plan to make use of cluster access control.
You also mentioned that your ultimate goal is to differentiate/manage costs on Azure Databricks. For this you can make use of tags. You can tag workspaces, clusters and pools which are then propagated to cost analysis reports in the Azure portal (see here).

Do Ambari users and groups link back to local users / groups on the cluster nodes?

Confused about what Ambari users and groups are.
When looking at the docs and the Ambari UI (Admin/Users and Admin/Groups), I get the impression that users / groups created in this interface should appear across all nodes in the cluster, but this does not seem to be the case, eg...
[root#HW01 ~]# id <some user created in Ambari UI>
id: <some user created in Ambari UI>: no such user
Same situation for groups created in Ambari UI admin section.
Not sure I understand to use of the Ambari users and groups if they do not somehow have a link back to user and groups locally on the hosts. Can someone please explain what is going on here?
By default, users and groups are maintained in the Ambari database, not by the OS.
Ambari has to separately be configured to support PAM or LDAP login roles
To configure these, you would use ambari-server setup-pam or ambari-server setup-ldap commands
I think newer versions of Ambari will create HDFS home directories for users, but that still doesn't mean those accounts are known to any datanode or namenode

how users should work with ambari cluster

My question is pretty trivial but didnt find anyone actually asking it.
We have a ambari cluster with spark storm hbase and hdfs(among other things).
I dont understand how a user that want to use that cluster use it.
for example, a user want to copy a file to hdfs, run a spark-shell or create new table in hbase shell.
should he get a local account on the server that run the cooresponded service? shouldn't he use a 3rd party machine(his own laptop for example)?
If so ,how one should use hadoop fs, there is no way to specify the server ip like spark-shell has.
what is the normal/right/expected way to run all these tasks from a user prespective.
Thanks.
The expected way to run the described tasks from the command line is as follows.
First, gain access to the command line of a server that has the required clients installed for the services you want to use, e.g. HDFS, Spark, HBase et cetera.
During the process of provisioning a cluster via Ambari, it is possible to define one or more servers where the clients will be installed.
Here you can see an example of an Ambari provisioning process step. I decided to install the clients on all servers.
Afterwards, one way to figure out which servers have the required clients installed is to check your hosts views in Ambari. Here you can find an example of an Ambari hosts view: check the green rectangle to see the installed clients.
Once you have installed the clients on one or more servers, these servers will be able to utilize the services of your cluster via the command line.
Just to be clear, the utilization of a service by a client is location-independent from the server where the service is actually running.
Second, make sure that you are compliant with the security mechanisms of your cluster. In relation to HDFS, this could influence which users you are allowed to use and which directories you can access by using them. If you do not use security mechanisms like e.g. Kerberos, Ranger and so on, you should be able to directly run your stated tasks from the command line.
Third, execute your tasks via command line.
Here is a short example of how to access HDFS without considering security mechanisms:
ssh user#hostxyz # Connect to the server that has the required HDFS client installed
hdfs dfs -ls /tmp # Command to list the contents of the HDFS tmp directory
Take a look on Ambari views, especially on Files view that allows browsing HDFS

Hadoop User Addition in Secured Cluster

We are using a kerborized CDH cluster. While adding a user to the cluster, we used to add the user only to the gateway/edge nodes as in any hadoop distro cluster. But with the newly added userIDs, we are not able to execute map-reduce/yarn jobs and throwing "user not found" exception.
When I researched through this, I came across a link https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/SecureContainer.html , which says to execute the yarn jobs in the secured cluster, we might need to have the corresponding user in all the nodes as the secure containers execute under the credentials of the job user.
So we added the corresponding userID to all the nodes and the jobs are getting executed.
If this is the case and if the cluster has around 100+ nodes, user provisioning for each userID would became a tedious job.
Can anyone please suggest any other effective way, if you came across the same scenario in your project implementation?
There are several approaches ordered by difficulty (from simple to painful).
One is to have a job-runner user that everyone uses to run jobs.
Another one is to use a configuration management tool to sync /etc/passwd and /etc/group (chef, puppet) on your cluster at regular intervals (1 hr - 1 day) or use a cron job to do this.
Otherwise you can buy or use open source Linux/UNIX user mapping services like Centrify (commercial), VAS (commercial), FreeIPA (free) or SSSD (free).
If you have an Active Directory server or LDAP server use the Hadoop LDAP user mappings.
References:
https://community.hortonworks.com/questions/57394/what-are-the-best-practises-for-unix-user-mapping.html
https://www.cloudera.com/documentation/enterprise/5-9-x/topics/cm_sg_ldap_grp_mappings.html

Installing cosmos-gui

Can you help me with installing cosmos-gui? I think you are one of the developers behind cosmos? Am I right?
We have already installed Cosmos, and now we want to install cosmos-gui.
In the link below, I found the install guide:
https://github.com/telefonicaid/fiware-cosmos/blob/develop/cosmos-gui/README.md#prerequisites
Under subchapter “Prerequisites” is written
A couple of sudoer users, one within the storage cluster and another one wihtin the computing clusters, are required. Through these users, the cosmos-gui will remotely run certain administration commands such as new users creation, HDFS userspaces provision, etc. The access through these sudoer users will be authenticated by means of private keys.
What is meant by the above? Must I create, a sudo user for the computing and storage cluster? And for that, do need to install a MySQL DB?
And under subchapter “Installing the GUI.”
Before continuing, remember to add the RSA key fingerprints of the Namenodes accessed by the GUI. These fingerprints are automatically added to /home/cosmos-gui/.ssh/known_hosts if you try an ssh access to the Namenodes for the first time.
I can’t make any sense about the above. Can you give a step by step plan?
I hope you can help me.
JH
First of all, a reminder about the Cosmos architecture:
There is a storage cluster based on HDFS.
There is a computing cluster based on shared Hadoop or based on Sahara; that's up to the administrator.
There is a services node for the storage cluster, a special node not storing data but exposing storage-related services such as HttpFS for data I/O. It is the entry point to the storage cluster.
There is a services node for the computing cluster, a special node not involved in the computations but exposing computing-related services such as Hive or Oozie. It is the entry point to the computing cluster.
There is another machine hosting the GUI, not belonging to any cluster.
Being said that, the paragraphs you mention try to explain the following:
Since the GUI needs to perform certain sudo operations on the storage and computing clusters for user account creation purposes, then a sudoer user must be created in both the services nodes. These sudoer users will be used by the GUI in order to remotely perform the required operations on top of ssh.
Regarding the RSA fingerprints, since the operations the GUI performs on the services nodes are executed in top of ssh, then the fingerprints the servers send back when you ssh them must be included in the .ssh/known_hosts file. You may do this manually, or simply ssh'ing the services nodes for the first time (you will be prompted to add the fingerprints to the file or not).
MySQL appears in the requirements because that section is about all the requisites in general, and thus they are listed. Not necessarily there may be relation maong them. In this particular case, MySQL is needed in order to store the accounts information.
We are always improving the documentation, we'll try to explain this better for the next release.

Resources