Hive User Impersonation for Sentry - hadoop

I was reading on that for while using sentry you must disable hive user impersonation.
Is it necessary to disable to impersonation? If Yes is there any other way to impersonate hive user with sentry enabled?

Impersonation and Sentry are two different ways to provide authorization in Hive. First one is based on "POSIX-like" hdfs file system permissions, while Sentry is role-based authorization module + SentryService.
There is no way to use Sentry with impersonation enabled in Hive. It could be a security issue. User/application with granted access to any entity (database, table) stored in hive metadata store could have access to any directory/file on hdfs that doesn't "belong" to him.
According to Cloudera the impersonation is not a recommended way to implement authorization in HiveServer2 (HiveServer2 Impersonation).

Related

What does it mean 'limited to Hive table data' in Apache Sentry reference?

Here https://www.cloudera.com/documentation/enterprise/5-9-x/topics/sentry_intro.html
we can read that
Apache Sentry Overview Apache Sentry is a granular, role-based
authorization module for Hadoop. Sentry provides the ability to
control and enforce precise levels of privileges on data for
authenticated users and applications on a Hadoop cluster. Sentry
currently works out of the box with Apache Hive, Hive
Metastore/HCatalog, Apache Solr, Impala, and HDFS (limited to Hive
table data).
What does it mean exactly that HDFS is limited to Hive table data?
Does it mean that I can't set privileges access for users to particular paths on HDFS?
For example,
I would like to set read access for user_A to path /my_test1
and write/read access for user_B to path /my_test1 and path /my_test2.
Is it possible with Apache Sentry?
Sentry controls do not replace HDFS ACLs. The synchronization between Sentry permissions and HDFS ACLs is one-way; that is, the Sentry plugin on the NameNode will apply Sentry permissions along with HDFS ACLs, so that HDFS enforces access to Hive table data according to Sentry's configuration, even when being accessed with other tools. Thus, HDFS access control is a means to enforcement of policies defined in Sentry in such a case.
Enforcement of arbitrary file access in HDFS should still be done via HDFS ACLs.

Restrict Folder Access in Hadoop

Two different groups of people plan to use our hadoop cluster, but I don't want them to see each other's data.
How can I prevent this functionality on hadoop cluster ?
I understand that if you set a environment variable you can easily impersonate the hadoop superuser and access all data in HDFS. Is there an simpler way to prevent this or kerberos and ldap based security is the only way to go?
Kerberos is the only way to prevent users in Hadoop from impersonating as hdfs superuser and misusing privileges.
Its very simple for users to impersonate as hdfs user (who happens to be the superuser of hadoop in most distributions). Anyone could do that by specifying the env variable HADOOP_USER_NAME to hdfs.

Is it possible in HDFS to partially encrypt a table? (some columns only)

I cannot find any attinent source.
I am working in Cloudera CDH 5.3
Any help is appreciated.
In case the tables are in Hive, Then cloudera has sentry ,refer this.
Sentry as of now doesnt support column level security , it could restrict users/groups from accessing/reading the content of the particular table.
Now in the case of Hbase tables. The hdfs file could be restricted be changing the access privilege or owner privileges.This could also be done for hive tables.
==Update==
Currently column level data encryption is not supported , there are few posts in jira regarding the same.
As a workaround i would suggest the following :
Develop a UDF for encryption and decryption separately using some
algorithm.
Use the encrypt function while data insertion, this would encrypt
the data and store it as encrypted in the hdfs.
Use the decryption UDF for decrypting whenever data is read.
Hope this helps.
You should have a look at Apache Accumulo it has got cell level security, and I believe that it is an installable service in Cloudera Manager
http://accumulo.apache.org/1.4/user_manual/Security.html
Each individual datam can be security labeled.

Flume-ng hdfs security

I'm new in hadoop and Flume NG and I need some help.
I don't understand how hdfs security implemented.
Here are lines from configuration from Flume User Guide:
# properties of hdfs-Cluster1-sink
agent_foo.sinks.hdfs-Cluster1-sink.type = hdfs
agent_foo.sinks.hdfs-Cluster1-sink.hdfs.path = hdfs://namenode/flume/webdata
Does it mean that anyone who knows my hdfs path can write any data to my hdfs?
The question is from some time ago, but I'll try to answer it for any other developer dealing with Flume and HDFS security.
Flume's HDFS sink just need the endpoint where the data is going to be persisted. It such an endpoint is secured or not, it depends entirely on Hadoop, not in Flume.
Hadoop ecosystem has several tools and system for implementing security, but focusing on those native elements, we talk about the authentication and authorization methods.
The authentication is based on Kerberos, and as any other auth mechanism, it is the process of determining whether someone or something is, in fact, who or what it is declared to be. So, by using auth it is not enough by knowing a HDFS user name, but you have to demostrate you own such a user by previously authenticating against Kerberos and obtaining a ticket. Authentication may be pasword-based or keytab-based; you can see the keytabs as "certificate files" containing the authentication keys.
The authorization can be implemented at file system, by deciding which permissions has any folder or file within HDFS. Thus, is a certaing file has only 600 permissions, then only its owner will be able to read or write it. Other authorization mechanisms like Hadoop ACLs can be used.
Being said that, if you have a look to the Flume sink, you'll see that there is a couple of parameters about Kerberos:
hdfs.kerberosPrincipal – Kerberos user principal for accessing secure HDFS
hdfs.kerberosKeytab – Kerberos keytab for accessing secure HDFS
In Kerberos terminology, a principal is a unique identity to which Kerberos can assign tickets. Thus, for each enabled user at HDFS you will need a principal registered in Kerberos. The keytab, as previously said, is a container for the authentication keys a certain principal owns.
Thus, if you want to secure your HDFS then install Kerberos, create principals and keytabs for each enabled user and configure the HDFS sink properly. In addition, change the permissions appropriately in your HDFS.

Regarding Hadoop Security via Kerberos

I am trying to learn how Kerberos can be implemented in Hadoop.
I have gone through this doc https://issues.apache.org/jira/browse/HADOOP-4487
I have also gone through Basic Kerberos stuff (https://www.youtube.com/watch?v=KD2Q-2ToloE)
1) The Apache doc uses the word "Token" whereas the general doc over the internet uses the term "Ticket".
Are Token and Ticket same ?
2) The Apache doc also "DataNodes do not enforce any access control on accesses to its data blocks.
This makes it possible for an unauthorized client to read a data block as
long as she can supply its block ID. It’s also possible for anyone to write
arbitrary data blocks to DataNodes."
My thoughts on this:-
I can fetch the block Id from file path using the command:-
hadoop#Studio-1555:/opt/hadoop/hadoop-1.0.2/bin$ ./hadoop fsck /hadoop/mapred/system/jobtracker.info -files -blocks
FSCK started by hadoop from /127.0.0.1 for path /hadoop/mapred/system/jobtracker.info at Mon Jul 09 06:57:14 EDT 2012
/hadoop/mapred/system/jobtracker.info 4 bytes, 1 block(s): OK
0. blk_-9148080207111019586_1001 len=4 repl=1
As I was authorized to access this file jobtracker.info, I was able to find its blockID using the above command.
I think that if I add some offset to this block ID and write to that datanode.
How can I explicitly mention the blockID while writing a file to HDFS.(What is the command ?)
Any other way to write arbitrary data blocks to DataNodes ?
Please tell me if my approach is wrong ?
Are Token and Ticket same ?
No. Tickets are issued by Kerberos and then servers in Hadoop (NameNode or JobTracker) issue tokens to provide authentication within the Hadoop cluster. Hadoop does not rely on Kerberos to authenticate running tasks, for instance, but uses its own tokens that were issued based on the Kerberos tickets.
The Apache doc also "DataNodes do not enforce any access control on accesses to its data blocks.
I'm guessing you're taking that from the JIRA where access control was provided (https://issues.apache.org/jira/browse/HADOOP-4359) via BlockAccessTokens. Assuming this is turned on - which it should be in a secure cluster - one cannot access a block on a datanode without such a token, which is issued by the NameNode after authentication and authorization via Kerberos and HDFS' own file system permissions.
How can I access the Datanode and write data arbitrarily ?
I am not sure what you mean here. Do you mean when the user does not have permission? As Jacob mentioned
you will not get a valid BlockAccessToken unless the user has the permissions to access the Data based on the file system permissions, assuming that you have secure Hadoop cluster.

Resources