Web interface login Apache Hadoop Cluster with Kerberos - hadoop

I've a Docker stack with an Apache Hadoop (version 3.3.4) cluster, composed by one namenode and two datanodes, and a container with both Kerberos admin server and Kerberos kdc.
I'm trying to configure Kerberos authentication on the Apache Hadoop cluster.
The namenode and the datanodes connect correctly to the Kerberos container and to each other using the Kerberos prncipals. However, authentication to the web interface of the namenode doesn't work and I get the following error:
The details of my configuration are the following.
I have four container in my Docker stack:
The namenode with hostname "namenodehostname.host.com" and alias "namenode";
First datanode with hostname "datanode1hostname.host.com" and alias "datanode1";
Second datanode with hostname "datanode2hostname.host.com" and alias "datanode2";
Kerberos server and kdc with hostname "krb5.host.com" and alias "kerberos".
The namenode and the datanodes starts with a custom user "hadoop" created in the Dockerfile.
All four containers have the following /etc/hosts file:
namenode  namenodehostname.host.com
datanode1 datanode1hostname.host.com
datanode2 datanode2hostname.host.com
kerberos krb5.host.com
The file krb5.conf (in the namenode, in the datanodes and in the Kerberos container) is:
[libdefaults]
default_realm = TESTREALM 
# The following krb5.conf variables are only for MIT Kerberos.
kdc_timesync = 1
ccache_type = 4
forwardable = true
proxiable = true # The following encryption type specification will be used by MIT Kerberos
# if uncommented.  In general, the defaults in the MIT Kerberos code are
# correct and overriding these specifications only serves to disable new
# encryption types as they are added, creating interoperability problems.
#
# The only time when you might need to uncomment these lines and change
# the enctypes is if you have local software that will break on ticket
# caches containing ticket encryption types it doesn't know about (such as
# old versions of Sun Java). #       default_tgs_enctypes = des3-hmac-sha1
#       default_tkt_enctypes = des3-hmac-sha1
#       permitted_enctypes = des3-hmac-sha1 # The following libdefaults parameters are only for Heimdal Kerberos. fcc-mit-ticketflags = true 
[realms]
TESTREALM = {
kdc = krb5.host.com
admin_server = krb5.host.com
} 
[domain_realm]
.host.com = TESTREALM
host.com = TESTREALM
The file kdc.conf (in the Kerberos container) is:
[kdcdefaults]
    kdc_ports = 750,88 
[realms]
    TESTREALM = {
        database_name = /etc/krb5kdc/data/database/principal
        admin_keytab = FILE:/etc/krb5kdc/data/keytabs/kadm5.keytab
        acl_file = /etc/krb5kdc/kadm5.acl
        key_stash_file = /etc/krb5kdc/data/stashfile/stash
        kdc_ports = 750,88
        max_life = 10h 0m 0s
        max_renewable_life = 7d 0h 0m 0s
        master_key_type = des3-hmac-sha1
        #supported_enctypes = aes256-cts:normal aes128-cts:normal
        default_principal_flags = +preauth
    }
I created five principals in Kerberos container:
root/admin
nn/namenodehostname.host.com: used by the namenode
HTTP/namenodehostname.host.com: used by the namenode
dn/datanode1hostname.host.com: used by first datanode
dn/datanode2hostname.host.com: used by second datanode
All this principals except for the root/admin, are mapped on the user "hadoop" in the namenode and in the datanodes (see property hadoop.security.auth_to_local in the del core-site.xml file).
I also created a keytab file for each principal ending in *.host.com.
The namenode is configured with the following files:
core-site.xml
<configuration>
<property><name>fs.defaultFS</name><value>hdfs://namenodehostname.host.com:9000</value></property>
<property><name>hadoop.security.authentication</name><value>kerberos</value></property>
<property><name>hadoop.security.authorization</name><value>true</value></property>
<property><name>hadoop.rpc.protection</name><value>authentication</value></property>
<property><name>hadoop.security.auth_to_local</name><value>
RULE:[1:$1](nn/namenodehostname.host.com#TESTREALM)s/^.*$/hadoop/
RULE:[1:$1](dn/datanode1hostname.host.com#TESTREALM)s/^.*$/hadoop/
RULE:[1:$1](dn/datanode2hostname.host.com#TESTREALM)s/^.*$/hadoop/
RULE:[1:$1](http/namenodehostname.host.com#TESTREALM)s/^.*$/hadoop/
DEFAULT</value></property>
<property><name>hadoop.http.filter.initializers</name><value>org.apache.hadoop.security.AuthenticationFilterInitializer</value></property>
<property><name>hadoop.http.authentication.token.validity</name><value>3600</value></property>
<property><name>hadoop.http.authentication.cookie.domain</name><value>host.com</value></property>
<property><name>hadoop.http.authentication.cookie.persistent</name><value>false</value></property>
<property><name>hadoop.http.authentication.simple.anonymous.allowed</name><value>false</value></property>
<property><name>hadoop.http.authentication.kerberos.principal</name><value>http/namenodehostname.host.com#TESTREALM</value></property>
<property><name>hadoop.http.authentication.kerberos.keytab</name><value>/etc/security/keytab/spnego.service.keytab</value></property>
</configuration>
hdfs-site.xml
<configuration>
<property><name>dfs.namenode.name.dir</name><value>file:///home/hadoop/hadoopdata/hdfs/namenode</value></property>
<property><name>dfs.namenode.edits.dir</name><value>file:///home/hadoop/hadooplogs/hdfs/edits</value></property>
<property><name>dfs.replication</name><value>1</value></property>
<property><name>dfs.datanode.http.address</name><value>0.0.0.0:8108</value></property>
<property><name>dfs.datanode.https.address</name><value>0.0.0.0:8109</value></property>
<property><name>dfs.webhdfs.enabled</name><value>true</value></property>
<property><name>dfs.client.use.datanode.hostname</name><value>true</value></property>
<property><name>dfs.datanode.use.datanode.hostname</name><value>true</value></property>
<property><name>dfs.namenode.acls.enabled</name><value>false</value></property>
<property><name>dfs.namenode.posix.acl.inheritance.enabled</name><value>true</value></property>
<property><name>dfs.permissions.enabled</name><value>false</value></property>
<property><name>dfs.namenode.datanode.registration.ip-hostname-check</name><value>false</value></property>
<property><name>dfs.namenode.rpc-bind-host</name><value>0.0.0.0</value></property>
<property><name>dfs.namenode.servicerpc-bind-host</name><value>0.0.0.0</value></property>
<property><name>dfs.namenode.http-bind-host</name><value>0.0.0.0</value></property>
<property><name>dfs.namenode.https-bind-host</name><value>0.0.0.0</value></property>
<property><name>dfs.block.access.token.enable</name><value>true</value></property>
<property><name>dfs.namenode.keytab.file</name><value>/etc/security/keytab/nn.service.keytab</value></property>
<property><name>dfs.namenode.kerberos.principal</name><value>nn/namenodehostname.host.com#TESTREALM</value></property>
<property><name>dfs.namenode.kerberos.internal.spnego.principal</name><value>HTTP/namenodehostname.host.com#TESTREALM</value></property>
<property><name>dfs.web.authentication.kerberos.keytab</name><value>/etc/security/keytab/spnego.service.keytab</value></property>
<property><name>dfs.web.authentication.kerberos.principal</name><value>HTTP/namenodehostname.host.com#TESTREALM</value></property>
<property><name>dfs.http.policy</name><value>HTTPS_ONLY</value></property>
</configuration>
The first datanode is configured with the following files:
core-site.xml
<configuration>
<property><name>fs.defaultFS</name><value>hdfs://namenodehostname.host.com:9000</value></property>
<property><name>hadoop.security.authentication</name><value>kerberos</value></property>
<property><name>hadoop.security.authorization</name><value>true</value></property>
<property><name>hadoop.rpc.protection</name><value>authentication</value></property>
<property><name>hadoop.security.auth_to_local</name><value>
RULE:[1:$1](nn/namenodehostname.host.com#TESTREALM)s/^.*$/hadoop/
RULE:[1:$1](dn/datanode1hostname.host.com#TESTREALM)s/^.*$/hadoop/
RULE:[1:$1](dn/datanode2hostname.host.com#TESTREALM)s/^.*$/hadoop/
RULE:[1:$1](http/namenodehostname.host.com#TESTREALM)s/^.*$/hadoop/
DEFAULT</value></property>
<property><name>hadoop.http.filter.initializers</name><value>org.apache.hadoop.security.AuthenticationFilterInitializer</value></property>
<property><name>hadoop.http.authentication.token.validity</name><value>3600</value></property>
<property><name>hadoop.http.authentication.cookie.domain</name><value>host.com</value></property>
<property><name>hadoop.http.authentication.cookie.persistent</name><value>false</value></property>
<property><name>hadoop.http.authentication.simple.anonymous.allowed</name><value>false</value></property>
<property><name>hadoop.http.authentication.kerberos.principal</name><value>http/namenodehostname.host.com#TESTREALM</value></property>
<property><name>hadoop.http.authentication.kerberos.keytab</name><value>/etc/security/keytab/spnego.service.keytab</value></property>
</configuration>
The second datanode has the same core-site.xml file of the first datanode and the following hdfs site.xml file.
<configuration>
<property><name>dfs.datanode.data.dir</name><value>file:///home/hadoop/hadoopdata/hdfs/datanode</value></property>
<property><name>dfs.datanode.failed.volumes.tolerated</name><value>0</value></property>
<property><name>dfs.datanode.address</name><value>0.0.0.0:8100</value></property>
<property><name>dfs.datanode.http.address</name><value>0.0.0.0:8108</value></property>
<property><name>dfs.webhdfs.enabled</name><value>true</value></property>
<property><name>dfs.client.use.datanode.hostname</name><value>true</value></property>
<property><name>dfs.datanode.use.datanode.hostname</name><value>true</value></property>
<property><name>dfs.permissions.enabled</name><value>false</value></property>
<property><name>dfs.namenode.datanode.registration.ip-hostname-check</name><value>false</value></property>
<property><name>dfs.namenode.rpc-bind-host</name><value>0.0.0.0</value></property>
<property><name>dfs.namenode.servicerpc-bind-host</name><value>0.0.0.0</value></property>
<property><name>dfs.namenode.http-bind-host</name><value>0.0.0.0</value></property>
<property><name>dfs.namenode.https-bind-host</name><value>0.0.0.0</value></property>
<property><name>dfs.datanode.hostname</name><value>datanode2hostname.host.com</value></property>
<property><name>dfs.block.access.token.enable</name><value>true</value></property>
<property><name>dfs.datanode.data.dir.perm</name><value>700</value></property>
<property><name>dfs.datanode.keytab.file</name><value>/etc/security/keytab/dn.service.keytab</value></property>
<property><name>dfs.datanode.kerberos.principal</name><value>dn/datanode2hostname.host.com#TESTREALM</value></property>
<property><name>dfs.encrypt.data.transfer</name><value>false</value></property>
<property><name>dfs.datanode.https.address</name><value>0.0.0.0:8109</value></property>
<property><name>dfs.data.transfer.protection</name><value>authentication</value></property>
<property><name>dfs.http.policy</name><value>HTTPS_ONLY</value></property>
</configuration>
The namenode and datanodes have the following
ssl-server.xml file.
<configuration>
<property><name>ssl.server.keystore.location</name><value>/home/hadoop/keystore.jks</value></property>
<property><name>ssl.server.keystore.password</name><value>password123.</value></property>
<property><name>ssl.server.keystore.type</name><value>JKS</value></property>
<property><name>ssl.server.truststore.location</name><value>/home/hadoop/truststore.jks</value></property>
<property><name>ssl.server.truststore.password</name><value>password123.</value></property>
<property><name>ssl.server.truststore.type</name><value>JKS</value></property>
</configuration>
Is there anything else I need to do to get login on the namenode web interface?

Related

'atlas.graph.index.search.max-result-set-size' doesn't map to a List object: 150

I tried to setup the sqoop-hook with Atlas Following these steps :
1- Set-up Atlas hook in sqoop-site.xml:
<property>
<name>sqoop.job.data.publish.class</name>
<value>org.apache.atlas.sqoop.hook.SqoopHook</value>
</property>
2- copy contents of folder apache-atlas-sqoop-hook to hook/sqoop
3- copy atlas-application.properties to sqoop-conf-dir
4- copy all jars in /hook/sqoop in sqoop-dir/lib
But when I tried to execute sqoop import command :
sqoop import --connect jdbc:postgresql://server/db --username user -P --table tab --hive-import --create-hive-table
I got the following error:
22/08/01 16:44:51 INFO atlas.ApplicationProperties: Setting atlas.graph.index.search.max-result-set-size = 150
22/08/01 16:44:51 INFO atlas.ApplicationProperties: Setting atlas.graph.index.search.solr.wait-searcher = false
22/08/01 16:44:51 INFO atlas.ApplicationProperties: Property (set to default) atlas.graph.cache.db-cache = true
22/08/01 16:44:51 INFO atlas.ApplicationProperties: Property (set to default) atlas.graph.cache.db-cache-clean-wait = 20
22/08/01 16:44:51 INFO atlas.ApplicationProperties: Property (set to default) atlas.graph.cache.db-cache-size = 0.5
22/08/01 16:44:51 INFO atlas.ApplicationProperties: Property (set to default) atlas.graph.cache.tx-cache-size = 15000
22/08/01 16:44:51 INFO atlas.ApplicationProperties: Property (set to default) atlas.graph.cache.tx-dirty-size = 120
22/08/01 16:44:51 INFO hook.AtlasHook: Failed to load application properties
org.apache.atlas.AtlasException: Failed to load application properties
at org.apache.atlas.ApplicationProperties.get(ApplicationProperties.java:155)
at org.apache.atlas.ApplicationProperties.get(ApplicationProperties.java:108)
at org.apache.atlas.hook.AtlasHook.<clinit>(AtlasHook.java:82)
at org.apache.atlas.sqoop.hook.SqoopHook.<clinit>(SqoopHook.java:86)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:264)
at org.apache.sqoop.mapreduce.PublishJobData.publishJobData(PublishJobData.java:46)
at org.apache.sqoop.mapreduce.ImportJobBase.runImport(ImportJobBase.java:284)
at org.apache.sqoop.manager.SqlManager.importTable(SqlManager.java:692)
at org.apache.sqoop.manager.PostgresqlManager.importTable(PostgresqlManager.java:127)
at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:520)
at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:628)
at org.apache.sqoop.Sqoop.run(Sqoop.java:147)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:183)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:234)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:243)
at org.apache.sqoop.Sqoop.main(Sqoop.java:252)
Caused by: org.apache.commons.configuration.ConversionException: 'atlas.graph.index.search.max-result-set-size' doesn't map to a List object: 150, a java.lang.Integer
at org.apache.commons.configuration.AbstractConfiguration.getList(AbstractConfiguration.java:1144)
at org.apache.commons.configuration.AbstractConfiguration.getList(AbstractConfiguration.java:1109)
at org.apache.commons.configuration.AbstractConfiguration.interpolatedConfiguration(AbstractConfiguration.java:1274)
at org.apache.atlas.ApplicationProperties.get(ApplicationProperties.java:150)
... 17 more
Exception in thread "main" java.lang.ExceptionInInitializerError
at org.apache.atlas.sqoop.hook.SqoopHook.<clinit>(SqoopHook.java:86)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:264)
at org.apache.sqoop.mapreduce.PublishJobData.publishJobData(PublishJobData.java:46)
at org.apache.sqoop.mapreduce.ImportJobBase.runImport(ImportJobBase.java:284)
at org.apache.sqoop.manager.SqlManager.importTable(SqlManager.java:692)
at org.apache.sqoop.manager.PostgresqlManager.importTable(PostgresqlManager.java:127)
at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:520)
at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:628)
at org.apache.sqoop.Sqoop.run(Sqoop.java:147)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:183)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:234)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:243)
at org.apache.sqoop.Sqoop.main(Sqoop.java:252)
Caused by: java.lang.NullPointerException
at org.apache.atlas.hook.AtlasHook.<clinit>(AtlasHook.java:87)
... 15 more
Here is the atlas-application.properties file :
######### Graph Database Configs #########
# Graph Database
#Configures the graph database to use. Defaults to JanusGraph
#atlas.graphdb.backend=org.apache.atlas.repository.graphdb.janus.AtlasJanusGraphDatabase
# Graph Storage
# Set atlas.graph.storage.backend to the correct value for your desired storage
# backend. Possible values:
#
# hbase
# cassandra
# embeddedcassandra - Should only be set by building Atlas with -Pdist,embedded-cassandra-solr
# berkeleyje
#
# See the configuration documentation for more information about configuring the various storage backends.
#
atlas.graph.storage.backend=hbase2
atlas.graph.storage.hbase.table=apache_atlas_janus
# atlas.graph.storage.username=
# atlas.graph.storage.password=
#atlas.graph.cache.db-cache=false
atlas.graph.cache.db-cache = true
atlas.graph.cache.db-cache-clean-wait = 20
atlas.graph.cache.db-cache-size = 0.5
atlas.graph.cache.tx-cache-size = 15000
atlas.graph.cache.tx-dirty-size = 120
#Hbase
#For standalone mode , specify localhost
#for distributed mode, specify zookeeper quorum here
atlas.graph.storage.hostname=localhost
atlas.graph.storage.hbase.regions-per-server=1
# Gremlin Query Optimizer
#
# Enables rewriting gremlin queries to maximize performance. This flag is provided as
# a possible way to work around any defects that are found in the optimizer until they
# are resolved.
#atlas.query.gremlinOptimizerEnabled=true
# Delete handler
#
# This allows the default behavior of doing "soft" deletes to be changed.
#
# Allowed Values:
# org.apache.atlas.repository.store.graph.v1.SoftDeleteHandlerV1 - all deletes are "soft" deletes
# org.apache.atlas.repository.store.graph.v1.HardDeleteHandlerV1 - all deletes are "hard" deletes
#
#atlas.DeleteHandlerV1.impl=org.apache.atlas.repository.store.graph.v1.SoftDeleteHandlerV1
# Entity audit repository
#
# This allows the default behavior of logging entity changes to hbase to be changed.
#
# Allowed Values:
# org.apache.atlas.repository.audit.HBaseBasedAuditRepository - log entity changes to hbase
# org.apache.atlas.repository.audit.CassandraBasedAuditRepository - log entity changes to cassandra
# org.apache.atlas.repository.audit.NoopEntityAuditRepository - disable the audit repository
#
atlas.EntityAuditRepository.impl=org.apache.atlas.repository.audit.HBaseBasedAuditRepository
# if Cassandra is used as a backend for audit from the above property, uncomment and set the following
# properties appropriately. If using the embedded cassandra profile, these properties can remain
# commented out.
# atlas.EntityAuditRepository.keyspace=atlas_audit
# atlas.EntityAuditRepository.replicationFactor=1
# Graph Search Index
atlas.graph.index.search.backend=solr5
#Solr
#Solr cloud mode properties
atlas.graph.index.search.solr.mode=cloud
atlas.graph.index.search.solr.zookeeper-url=localhost:2181
atlas.graph.index.search.solr.zookeeper-connect-timeout=60000
atlas.graph.index.search.solr.zookeeper-session-timeout=60000
atlas.graph.index.search.solr.wait-searcher=false
#Solr http mode properties
#atlas.graph.index.search.solr.mode=http
#atlas.graph.index.search.solr.http-urls=http://localhost:8983/solr
# Solr-specific configuration property
atlas.graph.index.search.max-result-set-size=150
######### Import Configs #########
#atlas.import.temp.directory=/temp/import
######### Notification Configs #########
atlas.notification.embedded=false
atlas.kafka.data=${sys:atlas.home}/data/kafka
atlas.kafka.zookeeper.connect=localhost:9026
atlas.kafka.bootstrap.servers=localhost:9027
atlas.kafka.zookeeper.session.timeout.ms=400
atlas.kafka.zookeeper.connection.timeout.ms=200
atlas.kafka.zookeeper.sync.time.ms=20
atlas.kafka.auto.commit.interval.ms=1000
atlas.kafka.hook.group.id=atlas
atlas.kafka.enable.auto.commit=false
atlas.kafka.auto.offset.reset=earliest
atlas.kafka.session.timeout.ms=30000
atlas.kafka.offsets.topic.replication.factor=1
atlas.kafka.poll.timeout.ms=1000
atlas.notification.create.topics=true
atlas.notification.replicas=1
atlas.notification.topics=ATLAS_HOOK,ATLAS_ENTITIES
atlas.notification.log.failed.messages=true
atlas.notification.consumer.retry.interval=500
atlas.notification.hook.retry.interval=1000
# Enable for Kerberized Kafka clusters
#atlas.notification.kafka.service.principal=kafka/_HOST#EXAMPLE.COM
#atlas.notification.kafka.keytab.location=/etc/security/keytabs/kafka.service.keytab
## Server port configuration
#atlas.server.http.port=21000
#atlas.server.https.port=21443
######### Security Properties #########
# SSL config
atlas.enableTLS=false
#truststore.file=/path/to/truststore.jks
#cert.stores.credential.provider.path=jceks://file/path/to/credentialstore.jceks
#following only required for 2-way SSL
#keystore.file=/path/to/keystore.jks
# Authentication config
atlas.authentication.method.kerberos=false
atlas.authentication.method.file=true
#### ldap.type= LDAP or AD
atlas.authentication.method.ldap.type=none
#### user credentials file
atlas.authentication.method.file.filename=${sys:atlas.home}/conf/users-credentials.properties
### groups from UGI
#atlas.authentication.method.ldap.ugi-groups=true
######## LDAP properties #########
#atlas.authentication.method.ldap.url=ldap://<ldap server url>:389
#atlas.authentication.method.ldap.userDNpattern=uid={0},ou=People,dc=example,dc=com
#atlas.authentication.method.ldap.groupSearchBase=dc=example,dc=com
#atlas.authentication.method.ldap.groupSearchFilter=(member=uid={0},ou=Users,dc=example,dc=com)
#atlas.authentication.method.ldap.groupRoleAttribute=cn
#atlas.authentication.method.ldap.base.dn=dc=example,dc=com
#atlas.authentication.method.ldap.bind.dn=cn=Manager,dc=example,dc=com
#atlas.authentication.method.ldap.bind.password=<password>
#atlas.authentication.method.ldap.referral=ignore
#atlas.authentication.method.ldap.user.searchfilter=(uid={0})
#atlas.authentication.method.ldap.default.role=<default role>
######### Active directory properties #######
#atlas.authentication.method.ldap.ad.domain=example.com
#atlas.authentication.method.ldap.ad.url=ldap://<AD server url>:389
#atlas.authentication.method.ldap.ad.base.dn=(sAMAccountName={0})
#atlas.authentication.method.ldap.ad.bind.dn=CN=team,CN=Users,DC=example,DC=com
#atlas.authentication.method.ldap.ad.bind.password=<password>
#atlas.authentication.method.ldap.ad.referral=ignore
#atlas.authentication.method.ldap.ad.user.searchfilter=(sAMAccountName={0})
#atlas.authentication.method.ldap.ad.default.role=<default role>
######### JAAS Configuration ########
#atlas.jaas.KafkaClient.loginModuleName = com.sun.security.auth.module.Krb5LoginModule
#atlas.jaas.KafkaClient.loginModuleControlFlag = required
#atlas.jaas.KafkaClient.option.useKeyTab = true
#atlas.jaas.KafkaClient.option.storeKey = true
#atlas.jaas.KafkaClient.option.serviceName = kafka
#atlas.jaas.KafkaClient.option.keyTab = /etc/security/keytabs/atlas.service.keytab
#atlas.jaas.KafkaClient.option.principal = atlas/_HOST#EXAMPLE.COM
######### Server Properties #########
atlas.rest.address=http://localhost:21000
# If enabled and set to true, this will run setup steps when the server starts
#atlas.server.run.setup.on.start=false
######### Entity Audit Configs #########
atlas.audit.hbase.tablename=apache_atlas_entity_audit
atlas.audit.zookeeper.session.timeout.ms=1000
atlas.audit.hbase.zookeeper.quorum=localhost:2181
######### High Availability Configuration ########
atlas.server.ha.enabled=false
#### Enabled the configs below as per need if HA is enabled #####
#atlas.server.ids=id1
#atlas.server.address.id1=localhost:21000
#atlas.server.ha.zookeeper.connect=localhost:2181
#atlas.server.ha.zookeeper.retry.sleeptime.ms=1000
#atlas.server.ha.zookeeper.num.retries=3
#atlas.server.ha.zookeeper.session.timeout.ms=20000
## if ACLs need to be set on the created nodes, uncomment these lines and set the values ##
#atlas.server.ha.zookeeper.acl=<scheme>:<id>
#atlas.server.ha.zookeeper.auth=<scheme>:<authinfo>
######### Atlas Authorization #########
atlas.authorizer.impl=simple
atlas.authorizer.simple.authz.policy.file=atlas-simple-authz-policy.json
######### Type Cache Implementation ########
# A type cache class which implements
# org.apache.atlas.typesystem.types.cache.TypeCache.
# The default implementation is org.apache.atlas.typesystem.types.cache.DefaultTypeCache which is a local in-memory type cache.
#atlas.TypeCache.impl=
######### Performance Configs #########
#atlas.graph.storage.lock.retries=10
#atlas.graph.storage.cache.db-cache-time=120000
######### CSRF Configs #########
atlas.rest-csrf.enabled=true
atlas.rest-csrf.browser-useragents-regex=^Mozilla.*,^Opera.*,^Chrome.*
atlas.rest-csrf.methods-to-ignore=GET,OPTIONS,HEAD,TRACE
atlas.rest-csrf.custom-header=X-XSRF-HEADER
############ KNOX Configs ################
#atlas.sso.knox.browser.useragent=Mozilla,Chrome,Opera
#atlas.sso.knox.enabled=true
#atlas.sso.knox.providerurl=https://<knox gateway ip>:8443/gateway/knoxsso/api/v1/websso
#atlas.sso.knox.publicKey=
############ Atlas Metric/Stats configs ################
# Format: atlas.metric.query.<key>.<name>
atlas.metric.query.cache.ttlInSecs=900
#atlas.metric.query.general.typeCount=
#atlas.metric.query.general.typeUnusedCount=
#atlas.metric.query.general.entityCount=
#atlas.metric.query.general.tagCount=
#atlas.metric.query.general.entityDeleted=
#
#atlas.metric.query.entity.typeEntities=
#atlas.metric.query.entity.entityTagged=
#
#atlas.metric.query.tags.entityTags=
######### Compiled Query Cache Configuration #########
# The size of the compiled query cache. Older queries will be evicted from the cache
# when we reach the capacity.
#atlas.CompiledQueryCache.capacity=1000
# Allows notifications when items are evicted from the compiled query
# cache because it has become full. A warning will be issued when
# the specified number of evictions have occurred. If the eviction
# warning threshold <= 0, no eviction warnings will be issued.
#atlas.CompiledQueryCache.evictionWarningThrottle=0
######### Full Text Search Configuration #########
#Set to false to disable full text search.
#atlas.search.fulltext.enable=true
######### Gremlin Search Configuration #########
#Set to false to disable gremlin search.
atlas.search.gremlin.enable=false
########## Add http headers ###########
#atlas.headers.Access-Control-Allow-Origin=*
#atlas.headers.Access-Control-Allow-Methods=GET,OPTIONS,HEAD,PUT,POST
#atlas.headers.<headerName>=<headerValue>
######### UI Configuration ########
atlas.ui.default.version=v1
atlas.hook.sqoop.synchronous=false
atlas.hook.sqoop.numRetries=3
atlas.hook.sqoop.queueSize=10000
Any Solutions Please

YARN Schedulers - Fair Scheduler - Running Jobs specifying Queue

How do we assign Jobs to specifying Queue when we have multiple queues. I'm using Yarn hadoop with AWS EMR
On AWS EMR you can create a cluster with Spark installed and set spark.scheduler.mode , using the following command, which references a file, myConfig.json stored in Amazon S3.
aws emr create-cluster --release-label emr-5.36.0 --applications Name=Spark \
--instance-type m5.xlarge --instance-count 2 --service-role EMR_DefaultRole --ec2-attributes InstanceProfile=EMR_EC2_DefaultRole --configurations https://s3.amazonaws.com/mybucket/myfolder/myConfig.json
myConfig.json:
[
{
"Classification": "spark-defaults",
"Properties": {
"spark.scheduler.mode": "FAIR"
}
}
]
Or you can specify which scheduler to use when initializing the job resource with the following parameters
val sparkConf = new SparkConf()
sparkConf.set("spark.scheduler.mode", "FAIR")
...
val spark = SparkSession.builder().config(sparkConf).getOrCreate()

How to load Impala table directly to Spark using JDBC?

I am trying to write a spark job with Python that would open a jdbc connection with Impala and load a VIEW directly from Impala into a Dataframe. This question is pretty close but in scala: Calling JDBC to impala/hive from within a spark job and creating a table
How do I do this? There are plenty of examples for other datasources such as MySQL, PostgreSQL, etc. but I haven't seen one for Impala + Python + Kerberos. An example would be of great help. Thank you!
Tried this with information from the web but it didn't work.
SPARK Notebook
#!/bin/bash
export PYSPARK_PYTHON=/home/anave/anaconda2/bin/python
export HADOOP_CONF_DIR=/etc/hive/conf
export PYSPARK_DRIVER_PYTHON=/home/anave/anaconda2/bin/ipython
export PYSPARK_DRIVER_PYTHON_OPTS='notebook --ip=* --no-browser'
# use Java8
export JAVA_HOME=/usr/java/latest
export PATH=$JAVA_HOME/bin:$PATH
# JDBC Drivers for Impala
export CLASSPATH=/home/anave/impala_jdbc_2.5.30.1049/Cloudera_ImpalaJDBC41_2.5.30/*.jar:$CLASSPATH
export JDBC_PATH=/home/anave/impala_jdbc_2.5.30.1049/Cloudera_ImpalaJDBC41_2.5.30
# --jars $SRCDIR/spark-csv-assembly-1.4.0-SNAPSHOT.jar \
# --conf spark.sql.parquet.binaryAsString=true \
# --conf spark.sql.hive.convertMetastoreParquet=false
pyspark --master yarn-client \
--driver-memory 4G \
--executor-memory 2G \
# --num-executors 10 \
--jars /home/anave/spark-csv_2.11-1.4.0.jar $JDBC_PATH/*.jar
--driver-class-path $JDBC_PATH/*.jar
Python Code
properties = {
"driver": "com.cloudera.impala.jdbc41.Driver",
"AuthMech": "1",
# "KrbRealm": "EXAMPLE.COM",
# "KrbHostFQDN": "impala.example.com",
"KrbServiceName": "impala"
}
# imp_env is the hostname of the db, works with other impala queries ran inside python
url = "jdbc:impala:imp_env;auth=noSasl"
db_df = sqlContext.read.jdbc(url=url, table='summary', properties=properties)
I received this error msg (Full Error Log):
Py4JJavaError: An error occurred while calling o42.jdbc.
: java.lang.ClassNotFoundException: com.cloudera.impala.jdbc41.Driver
You can use
--jars $(echo /dir/of/jars/*.jar | tr ' ' ',')
instead of
--jars /home/anave/spark-csv_2.11-1.4.0.jar $JDBC_PATH/*.jar
or for another approach please see my answer
1st approach is to use spark-submit on below impala_jdbc_connection.py script like spark-submit --driver-class-path /opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/jars/ImpalaJDBC41.jar --jars /opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/jars/ImpalaJDBC41.jar --class com.cloudera.impala.jdbc41.Driver impala_jdbc_connection.py
impala_jdbc_connection.py
properties = {
"drivers": "com.cloudera.impala.jdbc41.Driver"
}
#initalize the spark session
spark = (
SparkSession.builder
.config("spark.jars.packages", "jar-packages-list")
.config("spark.sql.warehouse.dir","hdfs://dwh-hdp-node01.dev.ergo.liferunoffinsuranceplatform.com:8020/user/hive/warehouse")
.enableHiveSupport()
.getOrCreate()
)
db_df = spark.read.jdbc(url= 'jdbc:impala://host_ip_address:21050/database_name', table ='table_name', properties = properties)
db_df.show()
2nd approach is not a direct import from impala to spark but rather a conversion of results to spark dataframe
pip install impyla Source: https://github.com/cloudera/impyla
Connect to impala and fetch results from impala database and convert result to spark dataframe
from impala.dbapi import connect
conn = connect(host = 'IP_ADDRESS_OF_HOST', port=21050)
cursor = conn.cursor()
cursor.execute('select * from database.table')
res= cursor.fetchall() # convert res to spark dataframe
for data in res:
print(data)
Did this in Azure Databricks notebook after setting up the jar in the cluster libraries. Generally followed previous post except that d is upper case for Driver config. Worked great.
properties = {
"Driver": "com.cloudera.impala.jdbc41.Driver"
}
db_df = spark.read.jdbc(url= 'jdbc:impala://hostname.domain.net:21050/dbname;AuthMech=3;UID=xxxx;PWD=xxxx', table ='product', properties = properties)
db_df.show()
This works for me:
spark-shell --driver-class-path ImpalaJDBC41.jar --jars ImpalaJDBC41.jar
val jdbcURL = s"jdbc:impala://192.168.56.101:21050;AuthMech=0"
val connectionProperties = new java.util.Properties()
val hbaseDF = sqlContext.read.jdbc(jdbcURL, "impala_table", connectionProperties)

MapR installation failing for single node cluster

I was referring quick installation guide for single node cluster. For this i used 20GB storage file for MaprFS but while on installation , it is giving ' Unable to find disks: /maprfs/storagefile' .
Here is my configuration file.
# Each Node section can specify nodes in the following format
# Hostname: disk1, disk2, disk3
# Specifying disks is optional. If not provided, the installer will use the values of 'disks' from the Defaults section
[Control_Nodes]
maprlocal.td.td.com: /maprfs/storagefile
#control-node2.mydomain: /dev/disk3, /dev/disk9
#control-node3.mydomain: /dev/sdb, /dev/sdc, /dev/sdd
[Data_Nodes]
#data-node1.mydomain
#data-node2.mydomain: /dev/sdb, /dev/sdc, /dev/sdd
#data-node3.mydomain: /dev/sdd
#data-node4.mydomain: /dev/sdb, /dev/sdd
[Client_Nodes]
#client1.mydomain
#client2.mydomain
#client3.mydomain
[Options]
MapReduce1 = true
YARN = true
HBase = true
MapR-DB = true
ControlNodesAsDataNodes = true
WirelevelSecurity = false
LocalRepo = false
[Defaults]
ClusterName = my.cluster.com
User = mapr
Group = mapr
Password = mapr
UID = 2000
GID = 2000
Disks = /maprfs/storagefile
StripeWidth = 3
ForceFormat = false
CoreRepoURL = http://package.mapr.com/releases
EcoRepoURL = http://package.mapr.com/releases/ecosystem-4.x
Version = 4.0.2
MetricsDBHost =
MetricsDBUser =
MetricsDBPassword =
MetricsDBSchema =
Below is the error that i am getting.
2015-04-16 08:18:03,659 callbacks 42 [INFO]: Running task: [Verify Pre-Requisites]
2015-04-16 08:18:03,661 callbacks 87 [ERROR]: maprlocal.td.td.com: Unable to find disks: /maprfs/storagefile from /maprfs/storagefile remove disks: /dev/sda,/dev/sda1,/dev/sda2,/dev/sda3 and retry
2015-04-16 08:18:03,662 callbacks 91 [ERROR]: failed: [maprlocal.td.td.com] => {"failed": true}
2015-04-16 08:18:03,667 installrunner 199 [ERROR]: Host: maprlocal.td.td.com has 1 failures
2015-04-16 08:18:03,668 common 203 [ERROR]: Control Nodes have failures. Please fix the failures and re-run the installation. For more information refer to the installer log at /opt/mapr-installer/var/mapr-installer.log
Please help me here.
Thanks
Shashi
Error is resolved by adding skip-check new option after install
/opt/mapr-installer/bin/install --skip-checks new

gogo: CommandNotFoundException: Command not found: services

I know some of the commands have changed names when Apache Felix started using GoGo
For example: ps --> lb (list bundles)
What is the equivalent for services <BUNDLENO>
I am trying to get the following output from my console:
services 5
Distributed OSGi Zookeeper-Based Discovery Single-Bundle Distribution (6) provides:
-----------------------------------------------------------------------------------
... other services ...
----
objectClass = org.osgi.service.cm.ManagedService
felix.fileinstall.filename = org.apache.cxf.dosgi.discovery.zookeeper.cfg
service.id = 38
service.pid = org.apache.cxf.dosgi.discovery.zookeeper
zookeeper.host = localhost
zookeeper.port = 2181
zookeeper.timeout = 3000
inspect capability service 5
check more details here
help inspect

Resources