AMQ Error reading in simpleString, length=xxx is greater than readableBytes=yyy - jms

I'm trying to work out how to fix this ActiveMQ Artemis error.
Seems the occasional message is too big for SimpleString, and isn't sending, and it goes to the DLQ.
java.lang.IndexOutOfBoundsException: Error reading in simpleString, length=1366648 is greater than readableBytes=127646#ClientLargeMessageImpl[messageID=578576793, durable=true, address=AuthCorrespondence.sendmail,userID=7f72137c-c3a3-11eb-87f7-0242c0a8e003,properties=TypedProperties[__AMQ_CID=b3f70eb1-be3c-11eb-87f7-0242c0a8e003,_AMQ_LARGE_SIZE=127651,_AMQ_ROUTING_TYPE=1]]
at org.apache.activemq.artemis.jms.client.ActiveMQMessageConsumer.getMessage(ActiveMQMessageConsumer.java:234)
at org.apache.activemq.artemis.jms.client.ActiveMQMessageConsumer.receive(ActiveMQMessageConsumer.java:132)
at org.springframework.jms.support.destination.JmsDestinationAccessor.receiveFromConsumer(JmsDestinationAccessor.java:130)
at org.springframework.jms.listener.AbstractPollingMessageListenerContainer.receiveMessage(AbstractPollingMessageListenerContainer.java:416)
at org.springframework.jms.listener.AbstractPollingMessageListenerContainer.doReceiveAndExecute(AbstractPollingMessageListenerContainer.java:302)
at org.springframework.jms.listener.AbstractPollingMessageListenerContainer.receiveAndExecute(AbstractPollingMessageListenerContainer.java:255)
at org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.invokeListener(DefaultMessageListenerContainer.java:1168)
at org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.executeOngoingLoop(DefaultMessageListenerContainer.java:1160)
at org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.run(DefaultMessageListenerContainer.java:1057)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IndexOutOfBoundsException: Error reading in simpleString, length=1366648 is greater than readableBytes=127646
at org.apache.activemq.artemis.api.core.SimpleString.readSimpleString(SimpleString.java:183)
at org.apache.activemq.artemis.api.core.SimpleString.readSimpleString(SimpleString.java:171)
at org.apache.activemq.artemis.api.core.SimpleString.readNullableSimpleString(SimpleString.java:158)
at org.apache.activemq.artemis.core.buffers.impl.ChannelBufferWrapper.readNullableSimpleString(ChannelBufferWrapper.java:69)
at org.apache.activemq.artemis.reader.TextMessageUtil.readBodyText(TextMessageUtil.java:37)
at org.apache.activemq.artemis.jms.client.ActiveMQTextMessage.doBeforeReceive(ActiveMQTextMessage.java:112)
at org.apache.activemq.artemis.jms.client.ActiveMQMessageConsumer.getMessage(ActiveMQMessageConsumer.java:228)
... 11 more
The most likely issue I can see is the similarity between readableBytes=127646 and _AMQ_LARGE_SIZE=127651.
From the docs, though, this _AMQ_LARGE_SIZE is the threshold for Large Messages, and it is supposed to be 2GB, and this message is what, 1.36MB?
What's going on?
EDIT:
[root#6dcbad102045 large-messages]# pwd
/opt/amq/broker/data/large-messages
[root#6dcbad102045 large-messages]# ls -l
total 13828
-rw-r--r-- 1 root root 6451200 Sep 14 2020 194154444.msg
-rw-r--r-- 1 root root 4198400 Nov 5 2020 266358970.msg
-rw-r--r-- 1 root root 1843200 Nov 13 2020 277265384.msg
-rw-r--r-- 1 root root 1433600 Apr 28 12:36 522483226.msg
-rw-r--r-- 1 root root 102400 Jun 2 15:07 578576791.msg
-rw-r--r-- 1 root root 127651 Jun 3 09:46 579961682.msg
I'm in Fuse/OSGi. 2.6.3.redhat-00015 for ActiveMQ Artemis JMS Client OSGi. 2.21.5 for camel-amqp. I can't work out what Artemis version it is. There's 1000+ successful deliveries. Just 6 fails.

The 2.6.3.redhat-00015 version corresponds to AMQ 7.2.3 which is quite old at this point. The current AMQ release is 7.8.1. I strongly recommend you upgrade as it's likely you're hitting a bug that's already been fixed.
You may be able to work around the issue by increasing the minimum large message size (e.g using minLargeMessageSize on core client URLs or amqpMinLargeMessageSize on your AMQP acceptor). For what it's worth, the stack-trace indicates that the core JMS client (i.e. not AMQP) is in use when the exception is thrown.
Lastly, it's worth noting that the default minimum large message size is 100 KB not 2 GB as explained in the documentation.

Related

Quarkus with Azure Text to Speech issue - Cognitive Services

I'm using Microsoft Cognitive Services within a Quarkus application. Everything works fine locally... including in a local Docker environment.
However when I deploy this thing to AKS... it logs the following error:
f506c46e-0234-4dc6-a3c8-27d6949cd4b1-1: org.jboss.resteasy.spi.UnhandledException: java.lang.UnsatisfiedLinkError: 'void com.microsoft.cognitiveservices.speech.SpeechConfig.setTempDirectory(java.lang.String)'
at org.jboss.resteasy.core.ExceptionHandler.handleApplicationException(ExceptionHandler.java:106)
at org.jboss.resteasy.core.ExceptionHandler.handleException(ExceptionHandler.java:372)
at org.jboss.resteasy.core.SynchronousDispatcher.writeException(SynchronousDispatcher.java:218)
at org.jboss.resteasy.core.SynchronousDispatcher.invoke(SynchronousDispatcher.java:519)
at org.jboss.resteasy.core.SynchronousDispatcher.lambda$invoke$4(SynchronousDispatcher.java:261)
at org.jboss.resteasy.core.SynchronousDispatcher.lambda$preprocess$0(SynchronousDispatcher.java:161)
at org.jboss.resteasy.core.interception.jaxrs.PreMatchContainerRequestContext.filter(PreMatchContainerRequestContext.java:364)
at org.jboss.resteasy.core.SynchronousDispatcher.preprocess(SynchronousDispatcher.java:164)
at org.jboss.resteasy.core.SynchronousDispatcher.invoke(SynchronousDispatcher.java:247)
at io.quarkus.resteasy.runtime.standalone.RequestDispatcher.service(RequestDispatcher.java:73)
at io.quarkus.resteasy.runtime.standalone.VertxRequestHandler.dispatch(VertxRequestHandler.java:138)
at io.quarkus.resteasy.runtime.standalone.VertxRequestHandler$1.run(VertxRequestHandler.java:93)
at io.quarkus.vertx.core.runtime.VertxCoreRecorder$13.runWith(VertxCoreRecorder.java:503)
at org.jboss.threads.EnhancedQueueExecutor$Task.run(EnhancedQueueExecutor.java:2442)
at org.jboss.threads.EnhancedQueueExecutor$ThreadBody.run(EnhancedQueueExecutor.java:1476)
at org.jboss.threads.DelegatingRunnable.run(DelegatingRunnable.java:29)
at org.jboss.threads.ThreadLocalResettingRunnable.run(ThreadLocalResettingRunnable.java:29)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.lang.UnsatisfiedLinkError: 'void com.microsoft.cognitiveservices.speech.SpeechConfig.setTempDirectory(java.lang.String)'
at com.microsoft.cognitiveservices.speech.SpeechConfig.setTempDirectory(Native Method)
at com.microsoft.cognitiveservices.speech.SpeechConfig.<clinit>(SpeechConfig.java:77)
I have checked the content of the tmp directory within the pod... and the files are created there... could this be due to the standard OS image that is added as part of a Quarkus container image?
For reference:
FROM registry.access.redhat.com/ubi8/ubi-minimal:8.3
ARG JAVA_PACKAGE=java-11-openjdk-headless
Content of PODs tmp:
/work # ls -ltr /tmp
total 84
drwxr-xr-x 2 root root 4096 Aug 3 12:10 hsperfdata_root
drwxrwxrwx 3 root root 4096 Aug 3 12:10 vertx-cache
-rw-r--r-- 1 root root 3839 Aug 3 12:20 m4j3821112471677595333.tmp
-rw-r--r-- 1 root root 68372 Aug 3 12:20 m4j3337239382403748465.tmp
drwx------ 2 root root 4096 Aug 3 12:20 speech-sdk-native-15500065335689334576
Azure dependency:
<dependency>
<groupId>com.microsoft.cognitiveservices.speech</groupId>
<artifactId>client-sdk</artifactId>
<version>1.18.0</version>
</dependency>
Not sure what is causing this error. Did anyone experience this behaviour before?

Sybase 12.5 vs 15.0 client connect libraries: 10x slower insert using 15.0 when inserting into 15.7 ASE

I maintain some legacy code that runs on RH Linux that sends inserts over the network to a client's Sybase. We were using Sybase 12.5 libraries and have just migrated to use Sybase 15.0 client libraries.
My application logs the time at which it sends the insert over the network and also the time it get the acknowledgment back from the target Sybase. When using 12.5 libraries the time was ~5 ms, now with the 15.5 libraries it's roughly 50 ms.
The only change I've made on the application side is to specify the location of the interfaces file on the command line. Previously the file was located in the default location - the location of the Sybase installation. Now it's located where the application is deployed, hence the need to specify the location explicitly.
Would anyone have any idea what is causing the dramatic change in speed, or have hints on where I could look or ideas on how to trace the root cause?
Please forgive the lack of technical details. I'm not a DB admin but a developer using a compiled library to connect to Sybase and don't have access to the nitty-gritty internals. That being said, I'm using the same internal library in both cases, it's only the Sybase librairies that are different.
My Sybase 12.5 and 15 installations look like this:
$ ls -l /opt/sybase/
total 48
-rw-r--r-- 1 root root 555 Jul 2 2019 ASE150.csh
-rw-r--r-- 1 root root 259 Jul 2 2019 ASE150.env
-rw-r--r-- 1 root root 388 Jul 2 2019 ASE150.sh
drwxr-xr-x 10 root root 4096 Feb 2 2017 OCS-15_0
-rw-r--r-- 1 root root 555 Jul 2 2019 SYBASE.csh
-rw-r--r-- 1 root root 259 Jul 2 2019 SYBASE.env
-rw-r--r-- 1 root root 388 Jul 2 2019 SYBASE.sh
drwxr-xr-x 58 root root 4096 Jul 2 2019 charsets
drwxr-xr-x 3 root root 4096 Jul 2 2019 collate
drwxr-xr-x 2 root root 4096 Nov 23 20:55 config
-rw-r--r-- 1 root root 1239 Jul 2 2019 interfaces
drwxr-xr-x 5 root root 4096 Nov 23 20:55 locales
$ ls -l ~/12_5/sybase/
total 28
drwxrwxr-x 4 oadc oadc 4096 Nov 29 2017 OCS-12_5
drwxrwxr-x 58 oadc oadc 4096 Nov 29 2017 charsets
drwxrwxr-x 2 oadc oadc 4096 Mar 16 09:45 config
drwxrwxr-x 2 oadc oadc 4096 Mar 16 09:45 include
-r-xr-xr-x 1 oadc oadc 1184 Mar 16 09:45 interfaces
drwxrwxr-x 2 oadc oadc 4096 Mar 16 09:45 lib
drwxrwxr-x 5 oadc oadc 4096 Mar 16 09:45 locales
EDIT
After some more digging it looks like the libraries under OCS-12-5 are not actually for 12_5 but for 15_5!
$ strings sybase/OCS-12_5lib/libsybct*.a | grep "Sybase Client-Library"
Sybase Client-Library/15.5/P/DRV.15.5.0/Linux x86_64/Linux 2.6.9-55.ELsmp x86_64/BUILD1550-003/64bit/OPT/Mon Oct 5 23:16:48 2009
Sybase Client-Library/15.5/P/DRV.15.5.0/Linux x86_64/Linux 2.6.9-55.ELsmp x86_64 Native Threads/BUILD1550-003/64bit/OPT/Tue Oct 6 00:06:57 2009
Which means that my assumption that 12.5 was faster than 15.0 is wrong. What is actually happening is that 15.5 is faster than 15.0. Which makes more sense.
I'm not going to go hunt down the idiot that submitted these files into a directory labelled OCS-12-5 ...
After some more digging it looks like the libraries under OCS-12-5 are not actually for 12_5 but for 15_5!
$ strings sybase/OCS-12_5lib/libsybct*.a | grep "Sybase Client-Library" Sybase Client-Library/15.5/P/DRV.15.5.0/Linux x86_64/Linux 2.6.9-55.ELsmp x86_64/BUILD1550-003/64bit/OPT/Mon Oct 5 23:16:48 2009 Sybase Client-Library/15.5/P/DRV.15.5.0/Linux x86_64/Linux 2.6.9-55.ELsmp x86_64 Native Threads/BUILD1550-003/64bit/OPT/Tue Oct 6 00:06:57 2009
Which means that my assumption that 12.5 was faster than 15.0 is wrong. What is actually happening is that 15.5 is faster than 15.0. Which makes more sense.
I've updated the question with this new information.

Rabbitmq /usr/local/etc/rabbitmq/rabbitmq-env.conf Missing

I just installed RabbitMQ on an AWS EC2-Instance (CentOS) using the following,
sudo yum install erlang
sudo yum install rabbitmq-server
I was then able to successfully turn it on using,
sudo chkconfig rabbitmq-server on
sudo /sbin/service rabbitmq-server start
...and
sudo /sbin/service rabbitmq-server stop
sudo sudo rabbitmq-server run in foreground;
But now I'm trying to modify the /usr/local/etc/rabbitmq/rabbitmq-env.conf file so I can change the NODE_IP_ADDRESS but the file is no where to be found.
No rabbitmq folder under,
[ec2-user#ip-0-0-0-0 sbin]$ ls /usr/local/etc
[ec2-user#ip-0-0-0-0 sbin]$
There's a rabbitmq folder under /etc but there's nothing in it,
[ec2-user#ip-0-0-0-0 rabbitmq]$ pwd
/etc/rabbitmq
[ec2-user#ip-0-0-0-0 rabbitmq]$ ls
[ec2-user#ip-0-0-0-0 rabbitmq]$
And the only thing in my environment variables for rabbitmq is this
[ec2-user#ip-0-0-0-0 rabbitmq]$ printenv | grep rabbit
PWD=/etc/rabbitmq
I was able to go to the location of the rabbitmq logs and find this information,
root#ip-0-0-0-0
[/var/log/rabbitmq]# pwd
/var/log/rabbitmq
root#ip-0-0-0-0
[/var/log/rabbitmq]# ls -al
total 20
drwxr-x--- 2 rabbitmq rabbitmq 4096 Jun 7 17:28 .
drwxr-xr-x 10 root root 4096 Jun 7 17:23 ..
-rw-r--r-- 1 rabbitmq rabbitmq 3638 Jun 7 17:33 rabbit#ip-0-0-0-0.log
-rw-r--r-- 1 rabbitmq rabbitmq 0 Jun 7 17:25 rabbit#ip-0-0-0-0-sasl.log
-rw-r--r-- 1 root root 0 Jun 7 17:28 shutdown_err
-rw-r--r-- 1 root root 65 Jun 7 17:28 shutdown_log
-rw-r--r-- 1 root root 0 Jun 7 17:25 startup_err
-rw-r--r-- 1 root root 385 Jun 7 17:28 startup_log
cat rabbit#ip-0-0-0-0.log
=INFO REPORT==== 7-Jun-2018::17:29:01 ===
node : rabbit#ip-0-0-0-0
home dir : /var/lib/rabbitmq
config file(s) : (none)
cookie hash : W/uaA12+PF+KOIbCmdKTkw==
log : /var/log/rabbitmq/rabbit#ip-0-0-0-0.log
sasl log : /var/log/rabbitmq/rabbit#ip-0-0-0-0-sasl.log
database dir : /var/lib/rabbitmq/mnesia/rabbit#ip-0-0-0-0
And /var/lib/rabbitmq contains this,
[/var/lib/rabbitmq/mnesia]# cd /var/lib/rabbitmq/
root#ip-0-0-0-0
[/var/lib/rabbitmq]# ls
mnesia
And
[/var/lib/rabbitmq/mnesia]# pwd
/var/lib/rabbitmq/mnesia
root#ip-0-0-0-0
[/var/lib/rabbitmq/mnesia]# ls -al
total 20
drwxr-xr-x 4 rabbitmq rabbitmq 4096 Jun 7 17:29 .
drwxr-x--- 3 rabbitmq rabbitmq 4096 Jun 7 17:25 ..
drwxr-xr-x 4 rabbitmq rabbitmq 4096 Jun 7 17:35 rabbit#ip-0-0-0-0
-rw-r--r-- 1 rabbitmq rabbitmq 5 Jun 7 17:28 rabbit#ip-0-0-0-0.pid
drwxr-xr-x 2 rabbitmq rabbitmq 4096 Jun 7 17:29 rabbit#ip-0-0-0-0-plugins-expand
root#ip-0-0-0-0
And,
[/var/lib/rabbitmq/mnesia/rabbit#ip-0-0-0-0]# pwd
/var/lib/rabbitmq/mnesia/rabbit#ip-0-0-0-0
root#ip-0-0-0-0
[/var/lib/rabbitmq/mnesia/rabbit#ip-0-0-0-0]# ls -al
total 100
drwxr-xr-x 4 rabbitmq rabbitmq 4096 Jun 7 17:35 .
drwxr-xr-x 4 rabbitmq rabbitmq 4096 Jun 7 17:29 ..
-rw-r--r-- 1 rabbitmq rabbitmq 59 Jun 7 17:29 cluster_nodes.config
-rw-r--r-- 1 rabbitmq rabbitmq 160 Jun 7 17:35 DECISION_TAB.LOG
-rw-r--r-- 1 rabbitmq rabbitmq 99 Jun 7 17:35 LATEST.LOG
drwxr-xr-x 2 rabbitmq rabbitmq 4096 Jun 7 17:29 msg_store_persistent
drwxr-xr-x 2 rabbitmq rabbitmq 4096 Jun 7 17:29 msg_store_transient
-rw-r--r-- 1 rabbitmq rabbitmq 29 Jun 7 17:29 nodes_running_at_shutdown
-rw-r--r-- 1 rabbitmq rabbitmq 1123 Jun 7 17:29 rabbit_durable_exchange.DCD
-rw-r--r-- 1 rabbitmq rabbitmq 2422 Jun 7 17:32 rabbit_durable_exchange.DCL
-rw-r--r-- 1 rabbitmq rabbitmq 8 Jun 7 17:25 rabbit_durable_queue.DCD
-rw-r--r-- 1 rabbitmq rabbitmq 8 Jun 7 17:25 rabbit_durable_route.DCD
-rw-r--r-- 1 rabbitmq rabbitmq 8 Jun 7 17:25 rabbit_runtime_parameters.DCD
-rw-r--r-- 1 rabbitmq rabbitmq 3 Jun 7 17:29 rabbit_serial
-rw-r--r-- 1 rabbitmq rabbitmq 344 Jun 7 17:35 rabbit_user.DCD
-rw-r--r-- 1 rabbitmq rabbitmq 193 Jun 7 17:29 rabbit_user_permission.DCD
-rw-r--r-- 1 rabbitmq rabbitmq 461 Jun 7 17:35 rabbit_user_permission.DCL
-rw-r--r-- 1 rabbitmq rabbitmq 134 Jun 7 17:29 rabbit_vhost.DCD
-rw-r--r-- 1 rabbitmq rabbitmq 289 Jun 7 17:32 rabbit_vhost.DCL
-rw-r--r-- 1 rabbitmq rabbitmq 19108 Jun 7 17:25 schema.DAT
-rw-r--r-- 1 rabbitmq rabbitmq 233 Jun 7 17:25 schema_version
And last but not least apparently the logs say there isn't a config file,
[/var/log/rabbitmq]# cat rabbit\#ip-0-0-0-0.log | grep config
config file(s) : (none)
config file(s) : (none)
RabbitMQ Version: {rabbit,"RabbitMQ","3.1.5"}
Does anyone know what's going on here? I'm surprised I didn't see any errors when I started the rabbitmq-server. Do I just create the config files myself?
UPDATE:
I was setting up a cluster environment for my Apache Airflow and so I was configuring it with the CeleryExecutor and setting up the Queue to be RabbitMQ. Turns out I'm running my EC2-Instance with Amazon Linux 1 which doesn't include systemd so I wasn't able to get RabbitMQ properly installed. Had I made my server using Amazon Linux 2 or Ubuntu, or any other Linux that doesn't suck I could have potentially gotten further in installing RabbitMQ and getting it to work with Airflow. So I went on to using AWS SQS for my queue and then I ran into this error. So by now I've wasted over two and a half days trying to just get a queue to work with Celery and Airflow and I read this article which says that Airbnb (the creators of Airflow) are using Celery with Redis as their Queue. So I tried it out and it literally took me three minutes to do and it's working flawlessly.... All I did was download Redis using sudo yum install redis then bam I had Redis installed. I started Redis using redis-server. Then I changed my airflow.cfg broker_url field to broker_url = redis://, ran airflow initdb, restarted the scheduler airflow scheduler, then started a worker airflow worker and BAM my DAGs started running using the Redis queue and CeleryExecutor. HALLELUJAH just use Redis as your queue....
The RabbitMQ team monitors the rabbitmq-users mailing list and only sometimes answers questions on StackOverflow.
You should be using the latest version of RabbitMQ (3.7.5) and Erlang 19.3 or later. Version 3.1.5 is very, very, very old. Please see this document for instructions on how to install a recent RMQ on an rpm-based distro.
After that, you will create rabbitmq-env.conf yourself.

Elasticsearch crashes with AccessDeniedException when providing new data.path

I updated the data.path entry in elasticsearch.yml to point at a new location. However, upon trying to restart Elastic, I now get the following error:
Starting elasticsearch: Exception in thread "main" java.lang.IllegalStateException: Unable to access 'path.data' (/etc/lib/stuff)
Likely root cause: java.nio.file.AccessDeniedException: /etc/lib
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:84)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
at sun.nio.fs.UnixFileSystemProvider.createDirectory(UnixFileSystemProvider.java:384)
at java.nio.file.Files.createDirectory(Files.java:674)
at java.nio.file.Files.createAndCheckIsDirectory(Files.java:781)
at java.nio.file.Files.createDirectories(Files.java:767)
at org.elasticsearch.bootstrap.Security.ensureDirectoryExists(Security.java:250)
at org.elasticsearch.bootstrap.Security.addPath(Security.java:227)
at org.elasticsearch.bootstrap.Security.addFilePermissions(Security.java:203)
at org.elasticsearch.bootstrap.Security.createPermissions(Security.java:184)
at org.elasticsearch.bootstrap.Security.configure(Security.java:105)
at org.elasticsearch.bootstrap.Bootstrap.setupSecurity(Bootstrap.java:196)
at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:167)
at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:285)
at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:35)
For the sake of debugging, I'm trying to move the data.path from the default of var/lib/elasticsearch to /var/lib/stuff.
As far as I can tell, the user, group, and permissions match across these two directories:
$ ls -l /var/lib/
drwxr-xr-x 3 elasticsearch elasticsearch 4096 Jan 11 10:50 elasticsearch
...
drwxr-xr-x 2 elasticsearch elasticsearch 4096 Jan 13 10:54 stuff
...
So why does Elastic throw an AccessDenied Exception?
If I remove my custom data.path it goes back to working without issue. Am I missing something?

Namenode HA (UnknownHostException: nameservice1)

We enable Namenode High Availability through Cloudera Manager, using
Cloudera Manager >> HDFS >> Action > Enable High Availability >> Selected Stand By Namenode & Journal Nodes
Then nameservice1
Once the whole process completed then Deployed Client Configuration.
Tested from Client Machine by listing HDFS directories (hadoop fs -ls /) then manually failover to standby namenode & again listing HDFS directories (hadoop fs -ls /). This test worked perfectly.
But When I ran hadoop sleep job using following command it failed
$ hadoop jar /opt/cloudera/parcels/CDH-4.6.0-1.cdh4.6.0.p0.26/lib/hadoop-0.20-mapreduce/hadoop-examples.jar sleep -m 1 -r 0
java.lang.IllegalArgumentException: java.net.UnknownHostException: nameservice1
at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:414)
at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:164)
at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:129)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:448)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:410)
at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:128)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2308)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:87)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2342)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2324)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:351)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:194)
at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:103)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:980)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:974)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:416)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1438)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:974)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:948)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1410)
at org.apache.hadoop.examples.SleepJob.run(SleepJob.java:174)
at org.apache.hadoop.examples.SleepJob.run(SleepJob.java:237)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.examples.SleepJob.main(SleepJob.java:165)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:622)
at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:64)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:622)
at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
Caused by: java.net.UnknownHostException: nameservice1
... 37 more
I dont know why its not able to resolved nameservice1 even after deploying client configuration.
When I google this issue I found only one solution to this issue
Add the below entry in configuration entry for fix the issue
dfs.nameservices=nameservice1
dfs.ha.namenodes.nameservice1=namenode1,namenode2
dfs.namenode.rpc-address.nameservice1.namenode1=ip-10-118-137-215.ec2.internal:8020
dfs.namenode.rpc-address.nameservice1.namenode2=ip-10-12-122-210.ec2.internal:8020
dfs.client.failover.proxy.provider.nameservice1=org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider
My impression was Cloudera Manager take cares of it. I checked client for this configuration & configuration was there (/var/run/cloudera-scm-agent/process/1998-deploy-client-config/hadoop-conf/hdfs-site.xml).
Also some more details of config files :
[11:22:37 root#datasci01.dev:~]# ls -l /etc/hadoop/conf.cloudera.*
/etc/hadoop/conf.cloudera.hdfs:
total 16
-rw-r--r-- 1 root root 943 Jul 31 09:33 core-site.xml
-rw-r--r-- 1 root root 2546 Jul 31 09:33 hadoop-env.sh
-rw-r--r-- 1 root root 1577 Jul 31 09:33 hdfs-site.xml
-rw-r--r-- 1 root root 314 Jul 31 09:33 log4j.properties
/etc/hadoop/conf.cloudera.hdfs1:
total 20
-rwxr-xr-x 1 root root 233 Sep 5 2013 container-executor.cfg
-rw-r--r-- 1 root root 1890 May 21 15:48 core-site.xml
-rw-r--r-- 1 root root 2546 May 21 15:48 hadoop-env.sh
-rw-r--r-- 1 root root 1577 May 21 15:48 hdfs-site.xml
-rw-r--r-- 1 root root 314 May 21 15:48 log4j.properties
/etc/hadoop/conf.cloudera.mapreduce:
total 20
-rw-r--r-- 1 root root 1032 Jul 31 09:33 core-site.xml
-rw-r--r-- 1 root root 2775 Jul 31 09:33 hadoop-env.sh
-rw-r--r-- 1 root root 1450 Jul 31 09:33 hdfs-site.xml
-rw-r--r-- 1 root root 314 Jul 31 09:33 log4j.properties
-rw-r--r-- 1 root root 2446 Jul 31 09:33 mapred-site.xml
/etc/hadoop/conf.cloudera.mapreduce1:
total 24
-rwxr-xr-x 1 root root 233 Sep 5 2013 container-executor.cfg
-rw-r--r-- 1 root root 1979 May 16 12:20 core-site.xml
-rw-r--r-- 1 root root 2775 May 16 12:20 hadoop-env.sh
-rw-r--r-- 1 root root 1450 May 16 12:20 hdfs-site.xml
-rw-r--r-- 1 root root 314 May 16 12:20 log4j.properties
-rw-r--r-- 1 root root 2446 May 16 12:20 mapred-site.xml
[11:23:12 root#datasci01.dev:~]#
I doubt its issue with old configuration in /etc/hadoop/conf.cloudera.hdfs1 & /etc/hadoop/conf.cloudera.mapreduce1 , but not sure.
looks like /etc/hadoop/conf/* never got updated
# ls -l /etc/hadoop/conf/
total 24
-rwxr-xr-x 1 root root 233 Sep 5 2013 container-executor.cfg
-rw-r--r-- 1 root root 1979 May 16 12:20 core-site.xml
-rw-r--r-- 1 root root 2775 May 16 12:20 hadoop-env.sh
-rw-r--r-- 1 root root 1450 May 16 12:20 hdfs-site.xml
-rw-r--r-- 1 root root 314 May 16 12:20 log4j.properties
-rw-r--r-- 1 root root 2446 May 16 12:20 mapred-site.xml
Anyone has any idea about this issue?
Looks like you are using wrong client configuration in /etc/hadoop/conf directory. Sometimes Cloudera Manager (CM) deploy client configurations option may not work.
As you have enabled NN HA, you should have valid core-site.xml and hdfs-site.xml files in your hadoop client configuration directory. For getting the valid site files, Go to HDFS service from CM Choose Download client configuration option from the Actions Button. you will get configuration files in zip format, extract the zip files and replace /etc/hadoop/conf/core-site.xml and /etc/hadoop/conf/hdfs-site.xml files with the extracted core-site.xml,hdfs-site.xml files.
Got it resolved. wrong config was linked to "/etc/hadoop/conf/" --> "/etc/alternatives/hadoop-conf/" --> "/etc/hadoop/conf.cloudera.mapreduce1"
It has to be "/etc/hadoop/conf/" --> "/etc/alternatives/hadoop-conf/" --> "/etc/hadoop/conf.cloudera.mapreduce"
below statement in my code resolved problem by specifying the host and port
val dfs = sqlContext.read.json("hdfs://localhost:9000//user/arvindd/input/employee.json")
I resolved this issue my putting the complete line to create RDD
myfirstrdd = sc.textFile("hdfs://192.168.35.132:8020/BUPA.txt")
and then I was able to do other RDD transformation .. Make sure you have the w/r/x to the file or you can do chmod 777

Resources