Permanent Kerberos tickets for interactive users of Hadoop cluster - hadoop

I have a Hadoop cluster which uses the company's Active Directory as Kerberos realm. The nodes, and the end-user Linux workstations are all Ubuntu 16.04. They are joined to the same domain using PowerBroker PBIS, so SSH logons between the workstations and the grid nodes are single sign-on. End-users run long-running scripts from their workstations, which repeatedly use SSH to first launch Spark / Yarn jobs on the cluster, and then keep track of their progress, which have to keep running overnight and on weekends well beyond the 10-hour lifetime of a Kerberos ticket.
I'm looking for a way to install permanent, service-style, Kerberos keytabs for the users, relieving them of the need to deal with kinit. I understand this would imply anyone with shell access to the grid as a particular user would be able to authenticate as that user.
I've also noticed that performing non-SSO SSH logins using password automatically creates net ticket valid from the time of the login. If this behaviour could be enabled for SSO logins, that would solve my problem.

You just have to ask users to add --principal and --keytab arguments to their Spark jobs. Then Spark (actually YARN) code will renew tickets for you automatically. We have jobs that run for weeks using this approach.
See for example https://spark.apache.org/docs/latest/security.html#yarn-mode
For long-running apps like Spark Streaming apps to be able to write to
HDFS, it is possible to pass a principal and keytab to spark-submit
via the --principal and --keytab parameters respectively. The keytab
passed in will be copied over to the machine running the Application
Master via the Hadoop Distributed Cache (securely - if YARN is
configured with SSL and HDFS encryption is enabled). The Kerberos
login will be periodically renewed using this principal and keytab and
the delegation tokens required for HDFS will be generated periodically
so the application can continue writing to HDFS.
You can see in Spark driver logs when Yarn renews a Kerberos ticket.

If you are accessing Hive/Hbase or any other components with need kerberos ticket then make your spark code to relogin in case of ticket expired. You have to update ticket to use keytab rather than relying on a TGT to already exist in the cache. This is done by using the UserGroupInformation class from the Hadoop Security package. Add below snippet in you spark job for long running-
val configuration = new Configuration
configuration.addResource("/etc/hadoop/conf/hdfs-site.xml")
UserGroupInformation.setConfiguration(configuration)
UserGroupInformation.getCurrentUser.setAuthenticationMethod(AuthenticationMethod.KERBEROS)
UserGroupInformation.loginUserFromKeytabAndReturnUGI(
"hadoop.kerberos.principal", " path of hadoop.kerberos.keytab file")
.doAs(new PrivilegedExceptionAction[Unit]() {
#Override
def run(): Unit = {
//hbase/hive connection
// logic
}
})
Above we specify the name of our service principal and the path to the keytab file we generated. As long as that keytab is valid our program will use the desired service principal for all actions, regardless of whether or not the user running the program has already authenticated and received a TGT.
If there is no other component access except spark then you don't need to write above code. Simply provide keytab and principal in you spark submit command.
spark-submit --master yarn-cluster --keytab "xxxxxx.keytab" --principal "svc-xxxx#xxxx.COM" xxxx.jar

I took the suggestion above to use the --keytab argument to specify a custom keytab on the grid node from which I submit to Spark. I create my own per-user keytab using the script below. It holds until the user changes password.
Note that the script makes the simplifying assumptions that the Kerberos realm is same as the DNS domain and the LDAP directory where users are defined. This holds for my setup, use with care on yours. It also expects the users to be sudoers on that grid node. A more refined script might separate keytab generation and installation.
#!/usr/bin/python2.7
from __future__ import print_function
import os
import sys
import stat
import getpass
import subprocess
import collections
import socket
import tempfile
def runSudo(cmd, pw):
try:
subprocess.check_call("echo '{}' | sudo -S -p '' {}".format(pw, cmd), shell = True)
return True
except subprocess.CalledProcessError:
return False
def testPassword(pw):
subprocess.check_call("sudo -k", shell = True)
if not runSudo("true", pw):
print("Incorrect password for user {}".format(getpass.getuser()), file = sys.stderr)
sys.exit(os.EX_NOINPUT)
class KeytabFile(object):
def __init__(self, pw):
self.userName = getpass.getuser()
self.pw = pw
self.targetPath = "/etc/security/keytabs/{}.headless.keytab".format(self.userName)
self.tempFile = None
KeytabEntry = collections.namedtuple("KeytabEntry", ("kvno", "principal", "encryption"))
def LoadExistingKeytab(self):
if not os.access(self.targetPath, os.R_OK):
# Note: the assumption made here, that the Kerberos realm is same as the DNS domain,
# may not hold in other setups
domainName = ".".join(socket.getfqdn().split(".")[1:])
encryptions = ("aes128-cts-hmac-sha1-96", "arcfour-hmac", "aes256-cts-hmac-sha1-96")
return [
self.KeytabEntry(0, "#".join( (self.userName, domainName)), encryption)
for encryption in encryptions ]
def parseLine(keytabLine):
tokens = keytabLine.strip().split(" ")
return self.KeytabEntry(int(tokens[0]), tokens[1], tokens[2].strip("()"))
cmd ="klist -ek {} | tail -n+4".format(self.targetPath)
entryLines = subprocess.check_output(cmd, shell = True).splitlines()
return map(parseLine, entryLines)
class KtUtil(subprocess.Popen):
def __init__(self):
subprocess.Popen.__init__(self, "ktutil",
stdin = subprocess.PIPE, stdout = subprocess.PIPE, stderr=subprocess.PIPE, shell = True)
def SendLine(self, line, expectPrompt = True):
self.stdin.write(bytes(line + "\n"))
self.stdin.flush()
if expectPrompt:
self.stdout.readline()
def Quit(self):
self.SendLine("quit", False)
rc = self.wait()
if rc != 0:
raise subprocess.CalledProcessError(rc, "ktutil")
def InstallUpdatedKeytab(self):
fd, tempKt = tempfile.mkstemp(suffix = ".keytab")
os.close(fd)
entries = self.LoadExistingKeytab()
ktutil = self.KtUtil()
for entry in entries:
cmd = "add_entry -password -p {} -k {} -e {}".format(
entry.principal, entry.kvno + 1, entry.encryption)
ktutil.SendLine(cmd)
ktutil.SendLine(self.pw)
os.unlink(tempKt)
ktutil.SendLine("write_kt {}".format(tempKt))
ktutil.Quit()
if not runSudo("mv {} {}".format(tempKt, self.targetPath), self.pw):
os.unlink(tempKt)
print("Failed to install the keytab to {}.".format(self.targetPath), file = sys.stderr)
sys.exit(os.EX_CANTCREAT)
os.chmod(self.targetPath, stat.S_IRUSR)
# TODO: Also change group to 'hadoop'
if __name__ == '__main__':
def main():
userPass = getpass.getpass("Please enter your password: ")
testPassword(userPass)
kt = KeytabFile(userPass)
kt.InstallUpdatedKeytab()
main()

Related

Specifying MSK credentials in an AWS CDK stack

I have code that seems to "almost" deploy. It will fail with the following error:
10:55:25 AM | CREATE_FAILED | AWS::Lambda::EventSourceMapping | QFDSKafkaEventSour...iltynotifyEFE73996
Resource handler returned message: "Invalid request provided: The secret provided in 'sourceAccessConfigurations' is not associated with cluster some-valid-an. Please provide a secret associated with the cluster. (Service: Lambda, Status Code: 400, Request ID: some-uuid )" (RequestToken: some-uuid, HandlerErrorCode: InvalidRequest)
I've cobbled together the cdk stack from multiple tutorials, trying to learn CDK. I've gotten it to the point that I can deploy a lambda, specify one (or more) layers for the lambda, and even specify any of several different sources for triggers. But our production Kafka requires credentials... and I can't figure out for the life of me how to supply those so that this will deploy correctly.
Obviously, those credentials shouldn't be included in the git repo of my codebase. I assume I will have to set up a Secrets Manager secret with part or all of the values. We're using scram-sha-512, and it includes a user/pass pair. The 'secret_name' value to Secret() is probably the name/path of the Secrets Manager secret. I have no idea what the second, unnamed param is for, and I'm having trouble figuring that out. Can anyone point me in the right direction?
Stack code follows:
#!/usr/bin/env python3
from aws_cdk import (
aws_lambda as lambda_,
App, Duration, Stack
)
from aws_cdk.aws_lambda_event_sources import ManagedKafkaEventSource
from aws_cdk.aws_secretsmanager import Secret
class ExternalRestEndpoint(Stack):
def __init__(self, app: App, id: str) -> None:
super().__init__(app, id)
secret = Secret(self, "Secret", secret_name="integrations/msk/creds")
msk_arn = "some valid and confirmed arn"
# Lambda layer.
lambdaLayer = lambda_.LayerVersion(self, 'lambda-layer',
code = lambda_.AssetCode('utils/lambda-deployment-packages/lambda-layer.zip'),
compatible_runtimes = [lambda_.Runtime.PYTHON_3_7],
)
# Source for the lambda.
with open("src/path/to/sourcefile.py", encoding="utf8") as fp:
mysource_code = fp.read()
# Config for it.
lambdaFn = lambda_.Function(
self, "QFDS",
code=lambda_.InlineCode(mysource_code),
handler="lambda_handler",
timeout=Duration.seconds(300),
runtime=lambda_.Runtime.PYTHON_3_7,
layers=[lambdaLayer],
)
# Set up the event (managed Kafka).
lambdaFn.add_event_source(ManagedKafkaEventSource(
cluster_arn=prototype_mks,
topic="foreign.endpoint.availabilty.notify",
secret=secret,
batch_size=100, # default
starting_position=lambda_.StartingPosition.TRIM_HORIZON
))
Looking into a code sample, I understand that you are working with Amazon MSK as an event source, and not just self-managed (cross-account) Kafka.
I assume I will have to set up a Secrets Manager secret with part or all of the values
You don't need to setup credentials. If you use MSK with SALS_SCRAM, you already have credentials, which must be associated with MSK cluster.
As you can see from the doc, you secret name should start with AmazonMSK_, for example AmazonMSK_LambdaSecret.
So, in the code above, you will need to fix this line:
secret = Secret(self, "Secret", secret_name="AmazonMSK_LambdaSecret")
I assume you already aware of the CDK python doc, but will just add here for reference.

Dialogflow CX - Location settings have to be initialized - FAILED_PRECONDITION

I am automating Dialogflow CX using Python client libraries. That includes agent/intent/entity etc. creation/updation/deletion.
But for the first time run, I am encountering the below error from python.
If I login to console and set the location from there and rerun the code, it is working fine. I am able to create agent.
Followed this URL of GCP -
https://cloud.google.com/dialogflow/cx/docs/concept/region
I am looking for code to automate the region & location setting before running the python code. Kindly provide me with the code.
Below is the code I am using to create agent.
Error -
google.api_core.exceptions.FailedPrecondition: 400 com.google.apps.framework.request.FailedPreconditionException: Location settings have to be initialized before creating the agent in location: us-east1. Code: FAILED_PRECONDITION
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.FAILED_PRECONDITION
details = "com.google.apps.framework.request.FailedPreconditionException: Location settings have to be initialized before creating the agent in location: us-east1. Code: FAILED_PRECONDITION"
debug_error_string = "{"created":"#1622183899.891000000","description":"Error received from peer ipv4:142.250.195.170:443","file":"src/core/lib/surface/call.cc","file_line":1068,"grpc_message":"com.google.apps.framework.request.FailedPreconditionException: Location settings have to be initialized before creating the agent in location: us-east1. Code: FAILED_PRECONDITION","grpc_status":9}"
main.py -
# Import Libraries
import google.auth
import google.auth.transport.requests
from google.cloud import dialogflowcx as df
from google.protobuf.field_mask_pb2 import FieldMask
import os, time
import pandas as pd
# Function - Authentication
def gcp_auth():
cred, project = google.auth.default(scopes=["https://www.googleapis.com/auth/cloud-platform"])
auth_req = google.auth.transport.requests.Request()
cred.refresh(auth_req)
# Function - Create Agent
def create_agent(agent_name, agent_description, language_code, location_id, location_path):
if location_id == "global":
agentsClient = df.AgentsClient()
else:
agentsClient = df.AgentsClient(client_options={"api_endpoint": f"{location_id}-dialogflow.googleapis.com:443"})
agent = df.Agent(display_name=agent_name, description=agent_description, default_language_code=language_code, time_zone=time_zone, enable_stackdriver_logging=True)
createAgentRequest = df.CreateAgentRequest(agent=agent, parent=location_path)
agent = agentsClient.create_agent(request=createAgentRequest)
return agent```
Currently, Dialogflow does not support configuring the location settings through the API, thus you can not initialise location settings through it. You can only set the location through the Console.
As an alternative, since the location setting has to be initialised only once for each region per project you could set the location and automate the agent creation process, some useful links: 1 and 2.
On the other hand, if you would find this feature useful, you can file a Feature Request, here. It will be evaluated by the Google's product team.
Many thanks Alexandre Moraes. I have raised a feature request for the same.

how to change spark.r.backendConnectionTimeout value in RStudio?

I am using RStudio to connect to my HDFS file using SparkR. When I leave Spark analyses running overnight, I get "R session aborted" error the next day. From Spark's documentation on SparkR (https://spark.apache.org/docs/latest/configuration.html), the default value of spark.r.backendConnectionTimeout is set to 6000s. I would like to change this value to something large that my connection doesn't time out after the analyses is done.
I have tried the following:
sparkR.session(master = "local[*]", sparkConfig = list(spark.r.backendConnectionTimeout = 10))
sparkR.session(master = "local[*]", spark.r.backendConnectionTimeout = 10)
I get the same output for both commands:
Spark package found in SPARK_HOME: C:\Spark\spark-2.3.2-bin-hadoop2.7
Launching java with spark-submit command C:\Spark\spark-2.3.2-bin-hadoop2.7/bin/spark-submit2.cmd sparkr-shell C:\Users\XYZ\AppData\Local\Temp\3\RtmpiEaE5q\backend_port696c18316c61
Java ref type org.apache.spark.sql.SparkSession id 1
It seems that the parameter was not passed correctly. Also, I am not sure where to pass that parameter.
Any help would be appreciated.
A similar post is around, but that involves Zeppelin (how to change spark.r.backendConnectionTimeout value?).
Thanks.
I found the solution: it is to modify the spark-defaults.conf file and add the following line:
spark.r.backendConnectionTimeout = 6000000
(or whatever time limit you want)
IMPORTANT note - restart hadoop and yarn services, and try connecting to Spark with SparkR normally:
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
sparkR.session(master = "local")
You can check if the settings took place or not at http://localhost:4040/environment/
I hope this comes useful for other people.

How to use Hadoop Credential provider in Spark to connect to Oracle database?

I am trying to establish a secure connection between Spark and Oracle as well as Sqoop and Oracle. After my research I have found two different option for two different setup.
Connecting Spark to Oracle where passwords are encrypted using spark.jdbc.b64password and further it has been decrypted in spark code and used it in jdbc url.
Using Hadoop credential provider to create the password file and further it has been used in Sqoop to connect to Oracle.
Now keeping password in two different files doesn't seems like a good practice. My question is can we use Hadoop credential provider in spark to use the same credential profile created for Sqoop?
If you have any other option to make this better please help.
The recommended way is to use Kerberos authentication both in Spark and Hadoop and with Oracle. The Oracle JDBC thin driver supports Kerberos authentication. A single Kerberos principal is then used to authenticate the user all the way from Spark or Hadoop to the Oracle database.
You could use all languages supported by Spark to read the jecks password from inside your code:
Python:
spark1 = SparkSession.builder.appName("xyz").master("yarn").enableHiveSupport().config("hive.exec.dynamic.partition", "true").config("hive.exec.dynamic.partition.mode", "nonstrict").getOrCreate()
x = spark1.sparkContext._jsc.hadoopConfiguration()
x.set("hadoop.security.credential.provider.path", "jceks://file///localpathtopassword")
a = x.getPassword("<password alias>")
passw = ""
for i in range(a.__len__()):
passw = passw + str(a.__getitem__(i))
In the above code you shall get the password string in passw
Scala:
import org.apache.hadoop.security.alias.CredentialProvider
import org.apache.hadoop.security.alias.CredentialProvider.CredentialEntry
import org.apache.hadoop.security.alias.CredentialProviderFactory
import org.apache.hadoop.conf.Configuration
val conf_H: Configuration = new org.apache.hadoop.conf.Configuration()
val alias = password_alias
val jceksPath = security_credential_provider_path`enter code here`
conf_H.set(CredentialProviderFactory.CREDENTIAL_PROVIDER_PATH, jceksPath)
val pass = conf_H.getPassword(alias).mkString
if (pass != null && !pass.isEmpty() && !pass.equalsIgnoreCase("")) {
jdbcPassword = pass
}
you can also allow spark to set hadoop.security.credential.provider.path in hadoop configuration in such way:
"""
Create java key store with following command:
> keytool -genseckey -alias duke -keypass 123456 -storetype jceks -keystore keystore.jceks
> export HADOOP_CREDSTORE_PASSWORD=123456
"""
jceks = os.path.join(os.path.dirname(__file__), "keystore.jceks")
print(jceks)
assert os.path.isfile(jceks)
spark_session = lambda: (SparkSession
.builder
.enableHiveSupport()
.config('spark.ui.enabled', False)
.config("spark.hadoop.hadoop.security.credential.provider.path",
"jceks://file//" + jceks)
.getOrCreate())
with spark_session() as spark:
hc = spark.sparkContext._jsc.hadoopConfiguration()
jo = hc.getPassword("duke")
expected_password = ''.join(jo)
assert len(retrieved_password) > 0
spark.hadoop.hadoop.security.credential.provider.path is little weird but spark cuts off spark.hadoop. prefix when loads hadoop settings

Not able to connect to hive on AWS EMR using java

I have setup AWS EMR cluster with hive. I want to connect to hive thrift server from my local machine using java. I tried following code-
Class.forName("com.amazon.hive.jdbc3.HS2Driver");
con = DriverManager.getConnection("jdbc:hive2://ec2XXXX.compute-1.amazonaws.com:10000/default","hadoop", "");
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/HiveJDBCDriver.html.As mentioned in the developer guide, added jars related with hive jdbc driver to class path.
But I am getting exception when trying to get connection.
I was able to connect to hive server on simple hadoop cluster using above code (with different jdbc driver).
Can someone please suggest if I am missing something?
Is it possible to connect to hive server on AWS EMR from local machine using hive jdbc?
(Merged Answer from the comments)
Hive is running on port 10000 but only locally, you have to create a ssh tunnel to the emr.
The following is from the documentation for hive 0.13.1
Create Tunnel
ssh -o ServerAliveInterval=10 -i path-to-key-file -N -L 10000:localhost:10000 hadoop#master-public-dns-name
Connect to JDBC
jdbc:hive2://localhost:10000/default
You can use the code using the library JSch
public static void portForwardForHive() {
try {
if(session != null && session.isConnected()) {
return;
}
JSch jsch = new JSch();
jsch.addIdentity(PATH_TO_SSH_KEY_PEM);
String host = REMOTE_HOST;
session = jsch.getSession(USER, host, 22);
// username and password will be given via UserInfo interface.
UserInfo ui = new MyUserInfo();
session.setUserInfo(ui);
session.connect();
int assingedPort = session.setPortForwardingL(LPORT, RHOST, RPORT);
System.out.println("Port forwarding done for the post : " + assingedPort);
} catch (Exception e) {
System.out.println(e);
}
}
Not sure if you've resolved this yet, but its a bug in EMR that's just bitten me.
For direct jdbc connectivity like you are doing, you must include the jdbc drivers in your shaded uber-jar. For jdbc access from within dataframes, you cannot access the jar in your uber-jar (another unrelated bug), but you must specify it on the command line (S3 is a convenient place to keep them):
--files s3://mybucketJAR/postgresql-9.4-1201.jdbc4.jar
However, even after this you will run into another problem if you are specifically trying to access hive. Amazon has built their own jdbc drivers with a different class hierarchy to the normal hive driver (com.amazon.hive.jdbc41.HS2Driver), however the EMR cluster includes the standard Hive jdbc driver in its standard path (org.apache.hive.jdbc.HiveDriver).
This is automatically registered as being capable of handling the jdbc:hive and jdbc:hive2 urls, so when you try to connect to a hive URL it finds this one first and uses it - even if you specifically register the amazon one. Unfortunately, this one is not compatible with amazon's EMR build of Hive.
There are two possible solutions:
1: Find the offending driver and unregister it:
Scala example:
val jdbcDrv = Collections.list(DriverManager.getDrivers)
for(i <- 0 until jdbcDrv.size) {
val drv = jdbcDrv.get(i)
val drvName = drv.getClass.getName
if(drvName == "org.apache.hive.jdbc.HiveDriver") {
log.info(s"Deregistering JDBC Driver: ${drvName}")
DriverManager.deregisterDriver(drv)
}
}
Or
2: As I found out later, you can specify the driver as part of the connect properties when you attempt to connect:
Scala example:
val hiveCredentials = new java.util.Properties
hiveCredentials.setProperty("user", hiveDBUser)
hiveCredentials.setProperty("password", hiveDBPassword)
hiveCredentials.setProperty("driver", "com.amazon.hive.jdbc41.HS2Driver")
val conn = DriverManager.getConnection(hiveDBURL, hiveCredentials)
This is a more "correct" version as it should override any preregistered handlers even if they have completely different class hierarchies.

Resources