Who accessed a Hive table or HDFS directory - hadoop

Is there a way to figure out which user ran a ‘select’ query against a Hive table? What time it was run?
More generically, which user accessed a HDFS directory?

HDFS has an audit log which will tell you which operations were run by which users. This is an old doc that shows how to enable audit logging but should still be relevant. For audit logging at the Hive level though, you'll have to look at some cutting edge tech.
Hortonworks acquired XASecure to implement security level features on top of their platform. Cloudera acquired Gazzang to do the same thing. They have some level of audit logging (and authorizations) for other services like Hive and HBase. They're also adding a lot more security related feature, but I'm not sure of the roadmap though.

Related

What does it mean 'limited to Hive table data' in Apache Sentry reference?

Here https://www.cloudera.com/documentation/enterprise/5-9-x/topics/sentry_intro.html
we can read that
Apache Sentry Overview Apache Sentry is a granular, role-based
authorization module for Hadoop. Sentry provides the ability to
control and enforce precise levels of privileges on data for
authenticated users and applications on a Hadoop cluster. Sentry
currently works out of the box with Apache Hive, Hive
Metastore/HCatalog, Apache Solr, Impala, and HDFS (limited to Hive
table data).
What does it mean exactly that HDFS is limited to Hive table data?
Does it mean that I can't set privileges access for users to particular paths on HDFS?
For example,
I would like to set read access for user_A to path /my_test1
and write/read access for user_B to path /my_test1 and path /my_test2.
Is it possible with Apache Sentry?
Sentry controls do not replace HDFS ACLs. The synchronization between Sentry permissions and HDFS ACLs is one-way; that is, the Sentry plugin on the NameNode will apply Sentry permissions along with HDFS ACLs, so that HDFS enforces access to Hive table data according to Sentry's configuration, even when being accessed with other tools. Thus, HDFS access control is a means to enforcement of policies defined in Sentry in such a case.
Enforcement of arbitrary file access in HDFS should still be done via HDFS ACLs.

Newbie: Hadoop IIS Logs - Reasonable approach?

I am a totaly beginner at the topic hadoop - so sorry if this is a stupid question.
My fictional scenario is, that I have several webserver (IIS) with several log locations. I want to centralize this log files and based on the data I want to analyze the health of the applications and the webservers.
Since the eco system of hadoop overs a variety of tools I am not sure if my solution is a valid one.
So I thought that I move the log files to hdfs, create an external table on the directory and an internal table and copy the data via hive (insert into ...select from) from the external table to internal table (with some filtering because of the comment lines beginning with #)
When the data is stored within the internal table I delete the previous moved files from hdfs.
Technical it works, I tried it already - but is this is reasonable aproach?
And if yes - how would I automatize this steps since now I did all the stuff manually via Ambari.
THanks for your input
BW
Yes, this is perfectly fine approach.
Outside of setting up the Hive table ahead of time, what's the left to automate?
You want to run things on a schedule? Use Oozie, Luigi, Airflow, or Azkaban.
Ingesting logs from other Windows servers because you have a highly available web service? Use Puppet, for example, to configure your log collections agents (not Hadoop related)
Note, if it's only log file collection that you care about, I would probably have used Elasticsearch instead of Hadoop to store data, Filebeat to continuously watch log files, Logstash to apply per-message level filtering, and Kibana to do visualizations. If combining Elasticsearch for fast indexing/searching and Hadoop for archival, you can insert Kafka between the log message ingestion and message writers/consumers

Authorizing Hadoop users without Sentry

I have a Kerberized CDH cluster, where there are some daily oozie workflows running. All of them use shell, impala-shell, hive and sqoop to ingest data to Hive tables (lets call these tables SensitiveTables)
Now, I want to create 2 new BI users to use the cluster and experiment with some other ingested data.
The requirement is that these new BI users:
should not have access to the SensitiveTables
should be able to spark-submit jobs to the cluster
(optionally) use Hue
Apart from setting-up Apache Sentry (which is the recommended way to go), is there any chance to meet those requirements using file-permissions or ACL and Service Level Authorization ?
So far, I managed (via hadoop fs -chmod o-rwx /user/hive/warehouse/sensitive) to restrict access to SensitiveTables via Hive (which uses user impersonation), but failed to do so via Impala (which submits all jobs to the cluster as user impala). Is there anything else I should try?
Thank you,
Gee
After a lot of research and based on the assumptions I described, the answer is NO. Furthermore, the metastore can not be protected this way.

What to use.. Impala on HDFS, or Impala on Hbase or just the Hbase?

I am working on Proof of Concept task.
The task is to implement a feature of our product using Hadoop technology.
Feature is quite simple, we have a UI which will let you insert details about "Network Issue".
All details about such a issue are captured and inserted into a table in Oracle DB.
We then process data in this table and calculate a Health Score.
I have to use Hadoop instead of a traditional Db So my question is what to go for?
Impala on HDFS? or
Impala on Hbase ? or
Hbase?
I am using a cloudera VM for the POC implementation.
As per my understanding, Hbase is NoSQL distributed database, which is actually a layer on HDFS , which provides java APIs to access data.
Impala is a tool which also provides JDBC access to access data over Hbase or directly over HDFS.
I am very new to hadoop, can some one please help?
Well, it depends on several things, like the kind of processing you are going to perform, desired response time etc. But by looking at whatever you have written here, HBase seems to be fine. I don't find any need of Impala as of now. HBase API is good and will serve your most of the needs.
IMHO, it's better to keep things simple initially and add a tool only if it is really required. Same holds good here. If you reach a point where you find that HBase API is not able to serve the purpose you could definitely add Impala to your stack.
That being said, there is one thing which you should keep in mind. HBase is a NoSQL DB and doesn't follow RDBMS conventions and terminologies. So, you might find it a bit strange initially. It's better to keep this in mind and then proceed as you have to design the schema in a way which is totally different from the RDBMS style of schema design.

How do add a new user to Hue/Hive with permissions?

I have a Hadoop cluster that I manage via the Hue interface to run Hive queries.
I would like to add another user to Hue and give them access to SOME of the table to run queries on. Is this possible?
Hue is just a view on top of Hive so using Hive Authorization should do it (beware: Hive Authorization is currently being reworked in order to be really secure).
You might want to add that user to the hadoop group that you must have created to run hadoop. Also you might want to create a separate working directory for each user on hadoop.

Resources