How to run hive on google cloud dataproc from within the machine? - hadoop

I've just created a google cloud dataproc cluster. A few basic things are not working for me:
I'm trying to run the hive console from the master node but it fails to load with any user other than root (it looks like there's a lock, the console is just stuck).
But even when using root, I see some odd behaviour:
"show tables;" shows a table named "input"
querying the table raises an exception that this table not found.
It is not clear which user is creating the tables through the web ui. I create a job, execute it, but then don't see the results through the console.
Couldn't find any good documentation on that - does anybody have an idea on this?

Running the hive command at present is somewhat broken due to the default metastore configuration.
I recommend you use the beeline client instead, which talks to the same Hive Server 2 as Dataproc Hive Jobs. You can use it via ssh by running beeline -u jdbc:hive2://localhost:10000 on the master.
YARN applications are submitted by the Hive Server 2 as the user "nobody", you can specify a different user by passing the -n flag to beeline, but it shouldn't matter with default permissions.

This thread is a bit old but when some one search Google Cloud Platform and Hive this result is coming. So I'm adding some info which may be useful.
Currently, in order to submit job to Google dataproc, I think - like all other products - there are 3 options:
from UI
from console using command line like:
gcloud dataproc jobs submit hive --cluster=CLUSTER (--execute=QUERY, -e QUERY | --file=FILE, -f FILE) [--async] [--bucket=BUCKET] [--continue-on-failure] [--jars=[JAR,…]] [--labels=[KEY=VALUE,…]] [--params=[PARAM=VALUE,…]] [--properties=[PROPERTY=VALUE,…]] [GLOBAL-FLAG …]
REST API call like: https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs/submit
Hope this will be useful to someone.

Related

As a Hadoop Regular User, Is There a Way to See Details about Running Jobs?

I do not have access to any CLI on any of the Hadoop nodes, but I have access to the cluster via Hue and Jupyter. The engineering team has also configured the Hadoop UI that shows New, Running, Submitted, Finish, etc. applications. However, it appears all spark jobs have a generic name, for instance, something like this:
HIVE-f23fa1a1-4444-4ab2-1c44-12345a123456
or similar and when I click on the application_id, I get a Failed to read the attempts of the application error. (even for my own jobs). Similarly, spark jobs, which you can normally name using setAppName, are all named generic "Spark-something" because the spark context is already initialized upon bringing up Jupyter on an edge node (i.e. I can't establish a name because one already exists).
Is there a way for a unprivileged Hadoop user to see into what job is actually running (i.e. the Hive query or the Spark / Hadoop command ), without having some sort of CLI privilege?
I have tried using a few urls that I suspect have job information in them, for instance:
http://cluster_master:<portnum>/history/application_1234123412341234_12345/jobs/ or
http://cluster_master:<portnum>/jobs/application_1234123412341234_12345/
but neither attempt returns any details about the job itself (even things I named myself within the hive / spark context using setAppName.
Please let me know if there's a better way to ask this question. I am relatively new to Hadoop/Spark. All the reference docs and SO answers I've found assume CLI or privileged access and I can't find any documentation in either Spark or Hadoop that applies to this problem.

Authorizing Hadoop users without Sentry

I have a Kerberized CDH cluster, where there are some daily oozie workflows running. All of them use shell, impala-shell, hive and sqoop to ingest data to Hive tables (lets call these tables SensitiveTables)
Now, I want to create 2 new BI users to use the cluster and experiment with some other ingested data.
The requirement is that these new BI users:
should not have access to the SensitiveTables
should be able to spark-submit jobs to the cluster
(optionally) use Hue
Apart from setting-up Apache Sentry (which is the recommended way to go), is there any chance to meet those requirements using file-permissions or ACL and Service Level Authorization ?
So far, I managed (via hadoop fs -chmod o-rwx /user/hive/warehouse/sensitive) to restrict access to SensitiveTables via Hive (which uses user impersonation), but failed to do so via Impala (which submits all jobs to the cluster as user impala). Is there anything else I should try?
Thank you,
Gee
After a lot of research and based on the assumptions I described, the answer is NO. Furthermore, the metastore can not be protected this way.

kerberos ticket and delegation token usage in Hadoop Oozie shell action

I am new to hadoop and trying to understand why my oozie shell action is not taking the new ticket even after doing kinit. here is my scenario.
I login using my ID "A", and have a kerberos ticket for my id. I submit oozie worklow with shell action using my ID.
Inside oozie shell action I do another kinit to obtain the ticket for ID "B".
Only this id "B" has access to some HDFS file. kinit is working fine since klist showed the ticket for id "B". Now when I read the HDFS file that only B has access to, I get permission denied error saying "A" does not have permission to access this file.
But when I do the same thing from linux cli, outside oozie, after I do kinit and take ticket for "B", I am able to read the HDFS file as "B".
But the same step is not working inside oozie shell action and hadoop fs commands always seem to work as the user that submitted the oozie workflow rather than the user for which kerberos ticket is present.
Can someone please explain why this is happening? I am not able to understand this.
In the same shell action, though hadoop fs command failed to change to user "B", hbase shell works as user B. Just for testing, I created a hbase table that only "A" has access to. I added the hbase shell to perform get command on this table. If I do kinit -kt for user "B" and get its ticket, this failed too, saying "B" does not have access to this table. So I think hbase is taking the new ticket instead of the delegation token of the user submitting the oozie workflow. When I dont do kinit -kt inside the shell action, hbase command succeeds.
If I do kinit, I could not even run hive queries saying "A" does not have execute access to some directories like /tmp/B/ that only "B" has access to, so I could not understand how hive is working, if it is taking the delegation token that is created when oozie workflow is submitted or if it is taking the new ticket created for new user.
Can someone please help me understand the above scenario? Which hadoop services takes new ticket for authentication and which commands take the delegation token (like hadoop fs commands)? Is this how it would work or am I doing something wrong?
I just dont understand why the same hadoop fs command worked from outside oozie as different user but not working inside oozie shell action even after kinit.
When is this delegation token actually get created? Does it get created only when oozie worklow is submitted or even I issue hadoop fs commands?
Thank you!
In theory -- Oozie automatically transfers the credentials of the submitter (i.e. A) to the YARN containers running the job. You don't have to care about kinit, because, in fact, it's too late for that.
You are not supposed to impersonate another user inside of an Oozie job, that would defeat the purpose of strict Kerberos authentication
In practice it's more tricky -- (1) the core Hadoop services (HDFS, YARN) check the Kerberos token just once, then they create a "delegation token" that is shared among all nodes and all services.
(2) the oozie service user has special privileges, it can do a kind of Hadoop "sudo" so that it connects to YARN as oozie but YARN creates the "delegation token" for the job submitter (i.e. A) and that's it, you can't alter change that token.
(3) well, actually you can use an alternate token, but only with some custom Java code that explicitly creates a UserGroupInformation object for an alternate user. The Hadoop command-line interfaces don't do that.
(4) what about non-core Hadoop, i.e. HBase or Hive Metastore, or non-Hadoop stuff, i.e. Zookeeper? They don't use "delegation tokens" at all. Either you manage explicitly the UserGroupInformation in Java code, or the default Kerberos token is used at connect time.
That's why your HBase shell worked, and if you had used Beeline (the JDBC thin client) instead of Hive (the legacy fat client) it would probably have worked too.
(5) Oozie tries to fill that gap with specific <credentials> options for Hive, Beeline ("Hive2" action), HBase, etc; I'm not sure how that works but it has to imply a non-default Kerberos ticket cache, local to your job containers.
We've found it possible to become another kerb principal once an oozie workflow is launched. We have to run a shell action that then runs java with a custom -Djava.security.auth.login.config=custom_jaas.conf that will then provide a jvm kinit'ed as someone else. This is along the lines of Samson's (3), although this kinit can be even a completely different realm.

No passwd entry for user 'hdfs'

I trying to set up a hive environment on my google compute engine hadoop clusters which was deployed from one click deployment.
When I try to switch to hdfs user(su hdfs), I get below error message.
No passwd entry for user 'hdfs'
The "one-click deployment" is an older sample which perhaps showcases installation from shell scripts and tarballs, but isn't intended for use as a supported Hadoop service, and doesn't set up typical Hadoop installation configurations like an hdfs user or adding commands to /usr/bin.
If you want a more Hadoop (and Pig+Hive+Spark) specialized service, you may want to consider using Google Cloud Dataproc, which is Google's managed Hadoop solution. You can create clusters from the cloud console UI in Dataproc just like click-to-deploy, and you'll get a more fully installed Hadoop/Hive environment, including a per-cluster persistent MySQL-based Hive metastore which is shared with SparkSQL to make it easy to play with Spark without modifying your Hive environment if you so choose.

Why do queries submitted via the Hive CLI not show up in the ResourceManager, but those via the Hue Beeswax interface do?

I've got a Cloudera Hadoop installation (CDH4) which runs the Yarn framework, and I've got Hue installed as well.
I've noticed that when I submit a Hive query via the Hue (Beeswax) interface, the resulting mapreduce job shows up in the resourcemanager web UI, as well as the Hue 'Job Browser' interface. However, if I run the hive cli application on any of the nodes and run the same query from there, it doesn't appear to hit any of the nodemanagers, although it does return the correct results.
The only difference I can think of is that the Hue job runs as the user I'm logged into Hue as, whereas the hive cli job runs as the user that started the hive cli, which is a different user.
I would expect queries submitted via the hive CLI to show up in the resource manager. Is there any reason why they are not?
Are you logged-in as the same user in Hue? Hue's JobBrowser filter shows you only your own jobs by default. You can reset the username filter and check if they are other jobs?

Resources