WSO2 BAM Cluster - hadoop

I´m configuring a WSO2 BAM Cluster with external Cassandra and Hadoop Cluster following the indications in the CLUSTER420 documentation for WSO2 BAM 2.5.0 and in section 14 I found this:
Change the following properties in the /repository/conf/advanced/hive-site.xml file.
Change the below properties if you are using the incremental data processing and notification task features.
<property>
<name>hive.incremental.processing.intermediate.results.cassandra.hosts</name>
<value>localhost:9160</value>
</property>
<property>
<name>hive.incremental.processing.intermediate.results.cassandra.userName</name>
<value>admin</value>
</property>
<property>
<name>hive.incremental.processing.intermediate.results.cassandra.password</name>
<value>admin</value>
</property>
<!-- Credentials for WSO2BAM_UTILS_KS -->
<property>
<name>notification.task.receiver.username</name>
<value>admin</value>
</property>
<property>
<name>notification.task.receiver.password</name>
<value>admin</value>
</property>
My questions:
In my cassandra cluster I had 3 nodes, so what I need to put in "hive.incremental.processing.intermediate.results.cassandra.hosts" property value?
In the "cassandra.userName" and "cassandra.password" I put admin and admin or my cassandra cluster credentials?
In the credentials for WSO2BAM_UTILS_KS I put admin and admin or what?
Thanks.

List all hostnames as comma separated list
provide cassendra cluster credentials
It is not necessary to put credentials for WSO2_BAM_UTIL_KS. Rather, mention all credentials correctly on $BAM_HOME/repository/conf/datasources/master-datasource.xml

Related

Set up Hadoop KMS

I tried to set up Hadoop KMS server and client.
below is my kms.site.xml
<property>
<name>hadoop.kms.key.provider.uri</name>
<value>jceks://file#/${user.home}/kms.keystore</value>
<description>
URI of the backing KeyProvider for the KMS.
</description>
</property>
<property>
<name>hadoop.security.keystore.java-keystore-provider.password-file</name>
<value>kms.keystore.password</value>
<description>
If using the JavaKeyStoreProvider, the file name for the keystore password.
</description>
</property>
In core-site.xml added below
<property>
<name>dfs.encryption.key.provider.uri</name>
<value>kms://http#mydomain:16000/kms</value>
</property>
in hdfs-site added below
<property>
<name>dfs.encryption.key.provider.uri</name>
<value>kms://http#mydomain:16000/kms</value>
</property>
Then restarted hadoop and used ./kms.sh start to start kms
But when i m trying to generate a key using below command
hadoop key create key_demo -size 256
i m getting below message , am i missing anything ?
There are no valid (non-transient) providers configured.
No action has been taken. Use the -provider option to specify
a provider. If you want to use a transient provider then you
MUST use the -provider argument.
I am using hadoop 3.3.1
this is my kms-site.xml:
<property>
<name>hadoop.kms.key.provider.uri</name>
<value>jceks://file#/${user.home}/kms.keystore</value>
</property>
<property>
<name>hadoop.security.keystore.java-keystore-provider.password-file</name>
<value>kms.keystore.password</value>
</property>
this is my core-site.xml:
<property>
<name>hadoop.security.key.provider.path</name>
<value>kms://http#localhost:9600/kms</value>
<description>
The KeyProvider to use when interacting with encryption keys used
when reading and writing to an encryption zone.
</description>
</property>
Before adding those to my core-site.xml, I also get the same message as yours. I think you are using hadoop v2, so your port number for keyProvider are still 16000, I use v3. I also see that you are still using JavaKeyStoreProvider like the example in hadoop documentation (so am I), if you don't provide "password file" which is kms.keystore.password the KMS will terminate immediately after starting up. So, you would need to place an empty file in your classpath, which is in /hadoop_home/etc/
I know i arrive quite late, hope it help.

Access yarn logs via Web UI

How can I access yarn job log via web ui?
I can view the job log via yarn manager web site. But every time yarn restart, the application list of yarn manager is empty. the picture is before restart
I can access application log via CLI command, even I restart yarn.
$HADOOP_HOME/bin/yarn logs -applicationId application_1499949542308_0020
The jobhistory server web ui is empty all the time
My log settings in yarn-site.xml and mapred-site.xml
<property>
<name>yarn.nodemanager.log-dirs</name>
<value>/home/hadoop/hadoop/nodemanager-logs</value>
</property>
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<property>
<name>yarn.nodemanager.remote-app-log-dir</name>
<value>/app-logs</value>
</property>
<property>
<name>yarn.nodemanager.remote-app-log-dir-suffix</name>
<value>logs</value>
</property>
<property>
<name>yarn.log-aggregation.retain-seconds</name>
<value>604800</value>
</property>
<property>
<name>yarn.log.server.url</name>
<value>http://hdp03.hp.sp.prd.bmsre.com:19888/jobhistory/logs</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>hdp03.hp.sp.prd.bmsre.com:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>hdp03.hp.sp.prd.bmsre.com:19888</value>
</property>
On Data Node, you can check the folder:
${HADOOP_HOME}/logs/userlogs
You just need to go to the folder which having the same name as the application id.
Yes, you can access Yarn retired jobs from Web UI.
access this url http://<jobtracker>:50070 to get the retired jobs.
With respect to your question you have restarted the yarn which means, a new log thread wakes up and does an upload of the logs to the configured location.
But in your question, does ' /app-logs' /app-logs path exists in your file-system. Please check.
There is a retention period, for how long the logs must be stored in that path and it is defined by the property name called yarn.log-aggregation.retain-seconds parameter.
To my understanding, the Job Tracker UI by default available at http://<jobtracker>:50070, exposes information on all currently running as well as retired MapReduce jobs and YARN has a JobHistory REST service that exposes details on finished applications.

The Job History web server can't connect to the Resource Manager in a HA setup

When using the YARN Resource Manager webapp, I have a problem viewing a page with job history, with URL like this:
http://node-0002.example.com:19888/jobhistory/job/job_1455545464819_0001
It shows a HTTP error with a Java exception like this:
java.lang.IllegalArgumentException: Does not contain a valid host:port authority: node-0001.example.com,node-0002.example.com:8088 (configuration property 'yarn.resourcemanager.webapp.address')
at org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:213)
at org.apache.hadoop.yarn.conf.YarnConfiguration.getSocketAddr(YarnConfiguration.java:1782)
at org.apache.hadoop.yarn.webapp.util.WebAppUtils.getResolvedRMWebAppURLWithoutScheme(WebAppUtils.java:164)
at org.apache.hadoop.mapreduce.v2.app.webapp.AppController.<init>(AppController.java:63)```
The hostname node-0001.example.com,node-0002.example.com is obviously wrong.
I have a pretty straightforward HA-enabled configuration in my yarn-site.xml:
<property>
<name>yarn.resourcemanager.ha.enabled</name>
<value>true</value>
</property>
<property>
<name>yarn.resourcemanager.cluster-id</name>
<value>my-cluster</value>
</property>
<property>
<name>yarn.resourcemanager.ha.rm-ids</name>
<value>node-0001.example.com,node-0002.example.com</value>
</property>
<property>
<name>yarn.resourcemanager.hostname.node-0001.example.com</name>
<value>node-0001.example.com</value>
</property>
<property>
<name>yarn.resourcemanager.hostname.node-0002.example.com</name>
<value>node-0002.example.com</value>
</property>
The only unusual thing is naming the rm-ids in the same way as the hosts. I also tried defining properties like yarn.resourcemanager.address and yarn.resourcemanager.webapp.address, but that didn't help.
Any idea where the comma-separated list instead of a hostname is coming from? I use Hadoop 2.7.1.

Issues in saving bulk data in HBase in Pseudo-distributed mode

I am setting up CDH4 in a pseudo-distributed mode.
I have set up Hadoop, and as suggested on CDH4 installation guide, have also completed the hdfs demo successfully.
I have also set up, HIVE, & HBase.
To populate the data in Hbase, I have written a java client, which populates the bulk data in HBase (around 1M rows each in 4 tables).
Now I am facing two issues:
When java client is running to port the dummy data into hbase, the regionserver shut down after around 4,50,000 rows of data is entered in total.
Using Hive, I am not able to access tables created in HBase, or worst, even cannot create tables from hive shell. Though, the hbase shell shows me the data/table structure (whetever has been generated before regionserver shut down.)
I have seen other posts regarding same. Seems that the 2nd issue is related to my /etc/hosts or hive-site.xml. Thus, I am pasting contents of both of them.
/etc/hosts
198.251.79.225 u17162752.onlinehome-server.com u17162752
198.251.79.225 default-domain.com
198.251.79.225 hbase.zookeeper.quorum localhost
198.251.79.225 cloudera-vm # Added by NetworkManager
127.0.0.1 localhost.localdomain localhost
127.0.1.1 cloudera-vm-local localhost
hive-site.xml
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost:3306/metastore</value>
<description>the URL of the MySQL database</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>mypassword</value>
</property>
<property>
<name>datanucleus.autoCreateSchema</name>
<value>false</value>
</property>
<property>
<name>datanucleus.fixedDatastore</name>
<value>true</value>
</property>
<property>
<name>hive.metastore.uris</name>
<value>thrift://127.0.0.1:9083</value>
<description>IP address (or fully-qualified domain name) and port of the metastore host</description>
</property>
<property>
<name>hive.support.concurrency</name>
<description>Enable Hive's Table Lock Manager Service</description>
<value>true</value>
</property>
<property>
<name>hive.zookeeper.quorum</name>
<description>Zookeeper quorum used by Hive's Table Lock Manager</description>
<value>zk1.myco.com,zk2.myco.com,zk3.myco.com</value>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<description>Zookeeper quorum used by Hive's Table Lock Manager</description>
<value>zk1.myco.com,zk2.myco.com,zk3.myco.com</value>
</property>
<property>
<name>hive.server2.authentication</name>
<value>NOSASL</value>
</property>
</configuration>
These issue are holding me from accomplish the task, I am supposed to.
Thanks in advance
Abhiskek
PS: This is my first post to this forum, so apologies, for anything inappropriate, you might have found! Thanks for bearing with me.
Hi Tariq, Thanks for your response. I have somehow managed to get over this. Now, I am facing another issue.
I am having 4 tables in HBase already, for which I want to create external tables in hive shell. But on running create external table commands on hive shell gives following error:
'ERROR: org.apache.hadoop.hbase.client.NoServerForRegionException: No server address listed in -ROOT- for region .META.,,1.1028785192 containing row'
Also, this error appears when I do something in HBase shell.
The other error that comes with the former one, on hbase shell is related to zookeeper. Stacktrace:
'WARN zookeeper.ZKUtil: catalogtracker-on- org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation#6a9a56bf- 0x1413718482c0010 Unable to get data of znode /hbase/unassigned/1028785192
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase/unassigned/1028785192'
Please help. Thanks!

Tables created by oozie hive action cannot be found from hive client but can find them in HDFS

I'm trying to run hive script via Oozie Hive Action, I just created a hive table 'test' in my script.q , and the oozie job ran successed, I can find the table created by oozie job under hdfs path /user/hive/warehouse. But I could not find the 'test' table via command "show tables" in Hive Client.
I think there is something wrong with my metastore config, but I just can't figure it out.
Can somebody help ?
oozie admin -oozie http://localhost:11000/oozie -status
System mode: NORMAL
oozie job -oozie http://localhost:11000/oozie -config C:\Hadoop\oozie-3.2.0-incubating\oozie-win-distro\examples\apps\hive\job.properties -run
Job ID : 0000001-130910094106919-oozie-hado-W
Run Result
Here is my oozie-site.xml
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<!--
Refer to the oozie-default.xml file for the complete list of
Oozie configuration properties and their default values.
-->
<property>
<name>oozie.service.ActionService.executor.ext.classes</name>
<value>
org.apache.oozie.action.email.EmailActionExecutor,
org.apache.oozie.action.hadoop.HiveActionExecutor,
org.apache.oozie.action.hadoop.ShellActionExecutor,
org.apache.oozie.action.hadoop.SqoopActionExecutor
</value>
</property>
<property>
<name>oozie.service.SchemaService.wf.ext.schemas</name>
<value>shell-action-0.1.xsd,email-action-0.1.xsd,hive-action-0.2.xsd,sqoop-action-0.2.xsd,ssh-action-0.1.xsd</value>
</property>
<property>
<name>oozie.system.id</name>
<value>oozie-${user.name}</value>
<description>
The Oozie system ID.
</description>
</property>
<property>
<name>oozie.systemmode</name>
<value>NORMAL</value>
<description>
System mode for Oozie at startup.
</description>
</property>
<property>
<name>oozie.service.AuthorizationService.security.enabled</name>
<value>false</value>
<description>
Specifies whether security (user name/admin role) is enabled or not.
If disabled any user can manage Oozie system and manage any job.
</description>
</property>
<property>
<name>oozie.service.PurgeService.older.than</name>
<value>30</value>
<description>
Jobs older than this value, in days, will be purged by the PurgeService.
</description>
</property>
<property>
<name>oozie.service.PurgeService.purge.interval</name>
<value>3600</value>
<description>
Interval at which the purge service will run, in seconds.
</description>
</property>
<property>
<name>oozie.service.CallableQueueService.queue.size</name>
<value>10000</value>
<description>Max callable queue size</description>
</property>
<property>
<name>oozie.service.CallableQueueService.threads</name>
<value>10</value>
<description>Number of threads used for executing callables</description>
</property>
<property>
<name>oozie.service.CallableQueueService.callable.concurrency</name>
<value>3</value>
<description>
Maximum concurrency for a given callable type.
Each command is a callable type (submit, start, run, signal, job, jobs, suspend,resume, etc).
Each action type is a callable type (Map-Reduce, Pig, SSH, FS, sub-workflow, etc).
All commands that use action executors (action-start, action-end, action-kill and action-check) use
the action type as the callable type.
</description>
</property>
<property>
<name>oozie.service.coord.normal.default.timeout
</name>
<value>120</value>
<description>Default timeout for a coordinator action input check (in minutes) for normal job.
-1 means infinite timeout</description>
</property>
<property>
<name>oozie.db.schema.name</name>
<value>oozie</value>
<description>
Oozie DataBase Name
</description>
</property>
<property>
<name>oozie.service.JPAService.create.db.schema</name>
<value>true</value>
<description>
Creates Oozie DB.
If set to true, it creates the DB schema if it does not exist. If the DB schema exists is a NOP.
If set to false, it does not create the DB schema. If the DB schema does not exist it fails start up.
</description>
</property>
<property>
<name>oozie.service.JPAService.jdbc.driver</name>
<value>org.apache.derby.jdbc.EmbeddedDriver</value>
<description>
JDBC driver class.
</description>
</property>
<property>
<name>oozie.service.JPAService.jdbc.url</name>
<value>jdbc:derby:${oozie.data.dir}/${oozie.db.schema.name}-db;create=true</value>
<description>
JDBC URL.
</description>
</property>
<property>
<name>oozie.service.JPAService.jdbc.username</name>
<value>sa</value>
<description>
DB user name.
</description>
</property>
<property>
<name>oozie.service.JPAService.jdbc.password</name>
<value>pwd</value>
<description>
DB user password.
IMPORTANT: if password is emtpy leave a 1 space string, the service trims the value,
if empty Configuration assumes it is NULL.
</description>
</property>
<property>
<name>oozie.service.JPAService.pool.max.active.conn</name>
<value>10</value>
<description>
Max number of connections.
</description>
</property>
<property>
<name>oozie.service.HadoopAccessorService.kerberos.enabled</name>
<value>false</value>
<description>
Indicates if Oozie is configured to use Kerberos.
</description>
</property>
<property>
<name>local.realm</name>
<value>LOCALHOST</value>
<description>
Kerberos Realm used by Oozie and Hadoop. Using 'local.realm' to be aligned with Hadoop configuration
</description>
</property>
<property>
<name>oozie.service.HadoopAccessorService.keytab.file</name>
<value>${user.home}/oozie.keytab</value>
<description>
Location of the Oozie user keytab file.
</description>
</property>
<property>
<name>oozie.service.HadoopAccessorService.kerberos.principal</name>
<value>${user.name}/localhost#${local.realm}</value>
<description>
Kerberos principal for Oozie service.
</description>
</property>
<property>
<name>oozie.service.HadoopAccessorService.jobTracker.whitelist</name>
<value> </value>
<description>
Whitelisted job tracker for Oozie service.
</description>
</property>
<property>
<name>oozie.service.HadoopAccessorService.nameNode.whitelist</name>
<value> </value>
<description>
Whitelisted job tracker for Oozie service.
</description>
</property>
<property>
<name>oozie.service.HadoopAccessorService.hadoop.configurations</name>
<value>*=hadoop-conf</value>
<description>
Comma separated AUTHORITY=HADOOP_CONF_DIR, where AUTHORITY is the HOST:PORT of
the Hadoop service (JobTracker, HDFS). The wildcard '*' configuration is
used when there is no exact match for an authority. The HADOOP_CONF_DIR contains
the relevant Hadoop *-site.xml files. If the path is relative is looked within
the Oozie configuration directory; though the path can be absolute (i.e. to point
to Hadoop client conf/ directories in the local filesystem.
</description>
</property>
<property>
<name>oozie.service.WorkflowAppService.system.libpath</name>
<value>/user/${user.name}/share/lib</value>
<description>
System library path to use for workflow applications.
This path is added to workflow application if their job properties sets
the property 'oozie.use.system.libpath' to true.
</description>
</property>
<property>
<name>use.system.libpath.for.mapreduce.and.pig.jobs</name>
<value>false</value>
<description>
If set to true, submissions of MapReduce and Pig jobs will include
automatically the system library path, thus not requiring users to
specify where the Pig JAR files are. Instead, the ones from the system
library path are used.
</description>
</property>
<property>
<name>oozie.authentication.type</name>
<value>simple</value>
<description>
Defines authentication used for Oozie HTTP endpoint.
Supported values are: simple | kerberos | #AUTHENTICATION_HANDLER_CLASSNAME#
</description>
</property>
<property>
<name>oozie.authentication.token.validity</name>
<value>36000</value>
<description>
Indicates how long (in seconds) an authentication token is valid before it has
to be renewed.
</description>
</property>
<property>
<name>oozie.authentication.signature.secret</name>
<value>oozie</value>
<description>
The signature secret for signing the authentication tokens.
If not set a random secret is generated at startup time.
In order to authentiation to work correctly across multiple hosts
the secret must be the same across al the hosts.
</description>
</property>
<property>
<name>oozie.authentication.cookie.domain</name>
<value></value>
<description>
The domain to use for the HTTP cookie that stores the authentication token.
In order to authentiation to work correctly across multiple hosts
the domain must be correctly set.
</description>
</property>
<property>
<name>oozie.authentication.simple.anonymous.allowed</name>
<value>true</value>
<description>
Indicates if anonymous requests are allowed.
This setting is meaningful only when using 'simple' authentication.
</description>
</property>
<property>
<name>oozie.authentication.kerberos.principal</name>
<value>HTTP/localhost#${local.realm}</value>
<description>
Indicates the Kerberos principal to be used for HTTP endpoint.
The principal MUST start with 'HTTP/' as per Kerberos HTTP SPNEGO specification.
</description>
</property>
<property>
<name>oozie.authentication.kerberos.keytab</name>
<value>${oozie.service.HadoopAccessorService.keytab.file}</value>
<description>
Location of the keytab file with the credentials for the principal.
Referring to the same keytab file Oozie uses for its Kerberos credentials for Hadoop.
</description>
</property>
<property>
<name>oozie.authentication.kerberos.name.rules</name>
<value>DEFAULT</value>
<description>
The kerberos names rules is to resolve kerberos principal names, refer to Hadoop's
KerberosName for more details.
</description>
</property>
<!-- Proxyuser Configuration -->
<!--
<property>
<name>oozie.service.ProxyUserService.proxyuser.#USER#.hosts</name>
<value>*</value>
<description>
List of hosts the '#USER#' user is allowed to perform 'doAs'
operations.
The '#USER#' must be replaced with the username o the user who is
allowed to perform 'doAs' operations.
The value can be the '*' wildcard or a list of hostnames.
For multiple users copy this property and replace the user name
in the property name.
</description>
</property>
<property>
<name>oozie.service.ProxyUserService.proxyuser.#USER#.groups</name>
<value>*</value>
<description>
List of groups the '#USER#' user is allowed to impersonate users
from to perform 'doAs' operations.
The '#USER#' must be replaced with the username o the user who is
allowed to perform 'doAs' operations.
The value can be the '*' wildcard or a list of groups.
For multiple users copy this property and replace the user name
in the property name.
</description>
</property>
-->
Here is my hive-site.xml
[hive-site.xml]
Here is my script.q
create table test(id int);
Inside your oozie hive action you need to tell oozie where your hive metastore is.
Means you need to pass your hive-site.xml as argument.
Also you need to configure external metastore for hive for it to work. The default derby database configuration will not work for you.
So in simple steps
Create hive settings with external database , say mysql
Pass that hive-site.xml to oozie action
See here for details
http://oozie.apache.org/docs/3.3.1/DG_HiveActionExtension.html
Thanks

Resources