Ambari dashboard retrieving no statistics - hadoop

I have a fresh install of Hortonworks Data Platform 2.2 installed on a small cluster (4 machines) but when I login to the Ambari GUI, the majority of dashboard stats boxes (HDFS disk usage, Network usage, Memory usage etc) are not populated with any statistics, instead they show the message:
No data There was no data available. Possible reasons include inaccessible Ganglia service
Clicking on the HDFS service link gives the following summary:
NameNode Started
SNameNode Started
DataNodes 4/4 DataNodes Live
NameNode Uptime Not Running
NameNode Heap n/a / n/a (0.0% used)
DataNodes Status 4 live / 0 dead / 0 decommissioning
Disk Usage (DFS Used) n/a / n/a (0%)
Disk Usage (Non DFS Used) n/a / n/a (0%)
Disk Usage (Remaining) n/a / n/a (0%)
Blocks (total) n/a
Block Errors n/a corrupt / n/a missing / n/a under replicated
Total Files + Directories n/a
Upgrade Status Upgrade not finalized
Safe Mode Status n/a
The Alerts and Health Checks box to the right of the screen is not displaying any information but if I click on the settings icon this opens the Nagios frontend and again, everything looks healthy here!
The install went smoothly (CentOS 6.5) and everything looks good as far as all services are concerned (all started with green tick next to service name). There are some stats displayed on the dashboard: 4/4 datanodes are live, 1/1 Nodemanages live & 1/1 Supervisors are live. I can write files to HDFS so its looks like it's a Ganglia issue?
The Ganglia daemon seems to be working ok:
ps -ef | grep gmond
nobody 1720 1 0 12:54 ? 00:00:44 /usr/sbin/gmond --conf=/etc/ganglia/hdp/HDPHistoryServer/gmond.core.conf --pid-file=/var/run/ganglia/hdp/HDPHistoryServer/gmond.pid
nobody 1753 1 0 12:54 ? 00:00:44 /usr/sbin/gmond --conf=/etc/ganglia/hdp/HDPFlumeServer/gmond.core.conf --pid-file=/var/run/ganglia/hdp/HDPFlumeServer/gmond.pid
nobody 1790 1 0 12:54 ? 00:00:48 /usr/sbin/gmond --conf=/etc/ganglia/hdp/HDPHBaseMaster/gmond.core.conf --pid-file=/var/run/ganglia/hdp/HDPHBaseMaster/gmond.pid
nobody 1821 1 1 12:54 ? 00:00:57 /usr/sbin/gmond --conf=/etc/ganglia/hdp/HDPKafka/gmond.core.conf --pid-file=/var/run/ganglia/hdp/HDPKafka/gmond.pid
nobody 1850 1 0 12:54 ? 00:00:44 /usr/sbin/gmond --conf=/etc/ganglia/hdp/HDPSupervisor/gmond.core.conf --pid-file=/var/run/ganglia/hdp/HDPSupervisor/gmond.pid
nobody 1879 1 0 12:54 ? 00:00:45 /usr/sbin/gmond --conf=/etc/ganglia/hdp/HDPSlaves/gmond.core.conf --pid-file=/var/run/ganglia/hdp/HDPSlaves/gmond.pid
nobody 1909 1 0 12:54 ? 00:00:48 /usr/sbin/gmond --conf=/etc/ganglia/hdp/HDPResourceManager/gmond.core.conf --pid-file=/var/run/ganglia/hdp/HDPResourceManager/gmond.pid
nobody 1938 1 0 12:54 ? 00:00:50 /usr/sbin/gmond --conf=/etc/ganglia/hdp/HDPNameNode/gmond.core.conf --pid-file=/var/run/ganglia/hdp/HDPNameNode/gmond.pid
nobody 1967 1 0 12:54 ? 00:00:47 /usr/sbin/gmond --conf=/etc/ganglia/hdp/HDPNodeManager/gmond.core.conf --pid-file=/var/run/ganglia/hdp/HDPNodeManager/gmond.pid
nobody 1996 1 0 12:54 ? 00:00:44 /usr/sbin/gmond --conf=/etc/ganglia/hdp/HDPNimbus/gmond.core.conf --pid-file=/var/run/ganglia/hdp/HDPNimbus/gmond.pid
nobody 2028 1 1 12:54 ? 00:00:58 /usr/sbin/gmond --conf=/etc/ganglia/hdp/HDPDataNode/gmond.core.conf --pid-file=/var/run/ganglia/hdp/HDPDataNode/gmond.pid
nobody 2057 1 0 12:54 ? 00:00:51 /usr/sbin/gmond --conf=/etc/ganglia/hdp/HDPHBaseRegionServer/gmond.core.conf --pid-file=/var/run/ganglia/hdp/HDPHBaseRegionServer/gmond.pid
I have checked the Ganglia service on each node, the processes are running as expected
ps -ef | grep gmetad
nobody 2807 1 2 12:55 ? 00:01:59 /usr/sbin/gmetad --conf=/etc/ganglia/hdp/gmetad.conf --pid-file=/var/run/ganglia/hdp/gmetad.pid
I have tried restarting Ganglia services with no luck, restarted all services but still the same. Does anyone have any ideas how I get the dashboard to work properly? Thank you.

It turns out to be a proxy issue, to access the internet I had to add my proxy details to the file /var/lib/ambari-server/ambari-env.sh
export AMBARI_JVM_ARGS=$AMBARI_JVM_ARGS' -Xms512m -Xmx2048m -Dhttp.proxyHost=theproxy -Dhttp.proxyPort=80 -Djava.security.auth.login.config=/etc/ambari-server/conf/krb5JAASLogin.conf -Djava.security.krb5.conf=/etc/krb5.conf -Djavax.security.auth.useSubjectCredsOnly=false'
When ganglia was trying to access each node in the cluster the request was going via the proxy and never resolving, to overcome the issue I added my nodes to the exclude list (add the flag -Dhttp.nonProxyHosts) like so:
export AMBARI_JVM_ARGS=$AMBARI_JVM_ARGS' -Xms512m -Xmx2048m -Dhttp.proxyHost=theproxy -Dhttp.proxyPort=80 -Dhttp.nonProxyHosts="localhost|node1.dms|node2.dms|node3.dms|etc" -Djava.security.auth.login.config=/etc/ambari-server/conf/krb5JAASLogin.conf -Djava.security.krb5.conf=/etc/krb5.conf -Djavax.security.auth.useSubjectCredsOnly=false'
After adding the exclude list the stats were retrieved as expected!

Related

Dronekit-sitl fails to bind on default port 5760

I have dronekit-sitl installed in a python3 virtual environment on my Windows 10 machine and have used it before by running dronekit-sitl copter with no issues. However, as of today I am running across what seems to be a permission issue when trying to execute the ArduCopter sitl.
$ dronekit-sitl copter
os: win, apm: copter, release: stable
SITL already Downloaded and Extracted.
Ready to boot.
Execute: C:\Users\kyrlon\.dronekit\sitl\copter-3.3\apm.exe --home=-35.363261,149.165230,584,353 --model=quad -I 0
SITL-0> Started model quad at -35.363261,149.165230,584,353 at speed 1.0
SITL-0.stderr> bind port 5760 for 0
Starting sketch 'ArduCopter'
bind failed on port 5760 - Operation not permitted
Starting SITL input
Not sure what might have triggered a new operation permission issue, and I tried to start over with a fresh Python environment, but even after a complete PC shutdown, I am still having the error as shown above.
It turns out that having docker on my system was the culprit and excluding the port I was attempting to use as mentioned in this SO post that led me to this github issue. Running the command in an elevated terminal:
netsh interface ipv4 show excludedportrange protocol=tcp
Provided me the results of the following excluded ports:
Protocol tcp Port Exclusion Ranges
Start Port End Port
---------- --------
1496 1595
1658 1757
1758 1857
1858 1957
1958 2057
2058 2157
2180 2279
2280 2379
2380 2479
2480 2579
2702 2801
2802 2901
2902 3001
3002 3101
3102 3201
3202 3301
3390 3489
3490 3589
3590 3689
3693 3792
3793 3892
3893 3992
3993 4092
4093 4192
4193 4292
4293 4392
4393 4492
4493 4592
4593 4692
4768 4867
4868 4967
5041 5140
5141 5240
5241 5340
5357 5357
5358 5457
5458 5557
5558 5657
5700 5700
5701 5800
8005 8005
8884 8884
15202 15301
15302 15401
15402 15501
15502 15601
15602 15701
15702 15801
15802 15901
15902 16001
16002 16101
16102 16201
16202 16301
16302 16401
16402 16501
16502 16601
16602 16701
16702 16801
16802 16901
16993 17092
17093 17192
50000 50059 *
* - Administered port exclusions.
Turns out that docker or possibly Hyper-V excluded the range that included 5760:
5701 5800
And as mentioned from the github issue, I probably resolved this issue before after a set number of restarts that incremented the port ranges, or possibly got lucky in the past starting dronekit-sitl before docker ran on my system.
Either way, to resolve this issue of Operation not permitted, running the command as admin:
net stop winnat
net start winnat
solved the issue with dronekit-sitl without having to specify a different port besides the default 5760.

Sonar-scanner hangs after 'Load active rules (done)' is shown in the logs

The tail of logging shows the following:
22:09:11.016 DEBUG: GET 200 http://someserversomewhere:9000/api/rules/search.protobuf?f=repo,name,severity,lang,internalKey,templateKey,params,actives,createdAt,updatedAt&activation=true&qprofile=AXaXXXXXXXXXXXXXXXw0&ps=500&p=1 | time=427ms
22:09:11.038 INFO: Load active rules (done) | time=12755ms
I have mounted the running container to see if the scanner process is pegged/running/etc and it shows the following:
Mem: 2960944K used, 106248K free, 67380K shrd, 5032K buff, 209352K cached
CPU: 0% usr 0% sys 0% nic 99% idle 0% io 0% irq 0% sirq
Load average: 5.01 5.03 4.83 1/752 46
PID PPID USER STAT VSZ %VSZ CPU %CPU COMMAND
1 0 root S 3811m 127% 1 0% /opt/java/openjdk/bin/java -Djava.awt.headless=true -classpath /opt/sonar-scanner/lib/sonar-scann
40 0 root S 2424 0% 0 0% bash
46 40 root R 1584 0% 0 0% top
I was unable to find any logging in the sonar-scanner-cli container to help indicate the state. It appears to just be hung and waiting for something to happen.
I am running Sonarqube locally from docker at the lts version 7.9.5
I am also running the docker container sonarsource:sonar-scanner-cli which is currently using the following version in the Dockerfile.
SONAR_SCANNER_VERSION=4.5.0.2216
I am triggering the scan via the following command:
docker run --rm \
-e SONAR_HOST_URL="http://someserversomewhere:9000" \
-e SONAR_LOGIN="nottherealusername" \
-e SONAR_PASSWORD="not12345likeinspaceballs" \
-v "$DOCKER_TEST_DIRECTORY:/usr/src" \
--link "myDockerContainerNameForSonarQube" \
sonarsource/sonar-scanner-cli -X -Dsonar.password=not12345likeinspaceballs -Dsonar.verbose=true \
-Dsonar.sources=app -Dsonar.tests=test -Dsonar.branch=master \
-Dsonar.projectKey="${PROJECT_KEY}" -Dsonar.log.level=TRACE \
-Dsonar.projectBaseDir=/usr/src/$PROJECT_NAME -Dsonar.working.directory=/usr/src/$PROJECT_NAME/$SCANNER_WORK_DIR
I have done a lot of digging to try to find anyone with similar issues and found the following older issue which seems to be similar but it is unclear how to determine if I am experiencing something related. Why does sonar-maven-plugin hang at loading global settings or active rules?
I am stuck and not sure what to do next any help or hints would be appreciated.
Additional note is that this process does work for the 8.4.2-developer version of Sonarqube that I am planning migrate to. The purpose of verifying 7.9.5 is to follow the recommended upgrade path from Sonarqube that recommends the interim step of first bringing your current version to the latest LTS then running the data migration before jumping to the next major version.

Performance Issue in spring boot api rest webservice

In our organization we have started an integration through a web service with api rest but we have a rare performance problem.
Data:
We have a virtual machine (VMWare) 4 core/8Gb ram. sufficient remote storage.
Ubuntu server 18.04
openjdk 11.0.7 2020-04-14
JAVA_OPTS='-Djava.awt.headless=true -Xms512m -Xmx2048m -XX:MaxPermSize=256m'
mysql: See 5.7.30-0ubuntu0.18.04.1 (It's running locally but the app connects by host name).
APP: Spring boot 2.1.3 (tomcat & spring data jpa & hikari & hibernate) All parameters by default.
top - 15:09:15 up 2 days, 14:21, 1 user, load average: 0.03, 0.01, 0.00
Tasks: 189 total, 1 running, 100 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.3 us, 0.2 sy, 0.0 ni, 99.5 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 8168140 total, 148740 free, 7590936 used, 428464 buff/cache
KiB Swap: 2097148 total, 1352428 free, 744720 used. 332048 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2383 app 20 0 41920 3944 3220 R 0.7 0.0 0:00.53 top
2698 app 20 0 5835612 402424 15312 S 0.7 4.9 23:13.92 java
1786 mysql 20 0 2680528 321892 8108 S 0.3 3.9 20:38.32 mysqld
2677 app 20 0 5850152 441440 15824 S 0.3 5.4 28:01.41 java <------
2769 app 20 0 5868308 977.2m 16868 S 0.3 12.3 49:25.72 java
ps -eaf | grep java
app 2677 2676 0 Jul07 ? 00:28:01 java -Dserver.port=4560 -jar app-ws-1.0.0-SNAPSHOT.jar <------
app 2698 2696 0 Jul07 ? 00:23:14 java -Dserver.port=4561 -jar app-ws-1.0.0-SNAPSHOT.jar
app 2769 2768 1 Jul07 ? 00:49:26 java -jar app-gui-1.0.0-SNAPSHOT.jar
We have 2 webservices, one functional (2677) and the other in testing (2698) and a web app (2768).
We have a problem with the first one. When processing calls the first one takes >30s, causing a timeout in the calling system, but the following calls are processed ok <5s.
The number of calls is minimum, 10 max. per day and never concurrent. Timeout can also occur if several hours pass without calls (>5h).
We have checked the code, we have checked WMware/Ubuntu (suspension options) and we haven't seen anything in the monitoring.
We have been told that it could be JVM and GC problems but I personally don't understand much and I haven't seen anything with the Memory analyzer.
Later on we have implemented in the app itself a dummy call (localhost) every 10 minutes to "warm up the machine" but even so the first call still takes >30s and the rest does not. The dummy call only answers ok.
We don't know what the cause could be and we don't know how to discard options since it is a productive environment and it doesn't admit many changes.

cluster-mode SPARK refuses to run more than two jobs concurrently

My Spark cluster refuses to run more than two jobs simultaneously. One of the three will invariable stay stuck in 'ACCEPTED' state.
Hardware
4 Data Node with spark clients, 24gb ram, 4processors
Cluster Metrics show there should be enough cores
Apps Submitted 3
Apps Pending 1
Apps Running 2
Apps Completed 0
Containers Running 4
Memory Used 8GB
Memory Total 32GB
Memory Reserved 0B
VCores Used 4
VCores Total 8
VCores Reserved 0
Active Nodes 2
Decommissioned Nodes 0
Lost Nodes 0
Unhealthy Nodes 0
Rebooted Nodes 0
On Application Manager you can see the final the only way to run the third app is to kill a running one
application_1504018580976_0002 adm com.x.app1 SPARK default 0 [date] N/A RUNNING UNDEFINED 2 2 5120 25.0 25.0
application_1500031233020_0090 adm com.x.app2 SPARK default 0 [date] N/A RUNNING UNDEFINED 2 2 3072 25.0 25.0
application_1504024737012_0001 adm com.x.app3 SPARK default 0 [date] N/A ACCEPTED UNDEFINED 0 0 0 0.0 0.0
The running apps have 2x containers and 2x allocated vcores, 25% of the queue and 25% of the cluster.
Deployment command for all 3 apps.
/usr/hdp/current/spark2-client/bin/spark-submit
--master yarn
--deploy-mode cluster
--driver-cores 1
--driver-memory 512m
--num-executors 1
--executor-cores 1
--executor-memory 1G
--class com..x.appx ../lib/foo.jar
Capacity Scheduler
yarn.scheduler.capacity.default.minimum-user-limit-percent = 100
yarn.scheduler.capacity.maximum-am-resource-percent = 0.2
yarn.scheduler.capacity.maximum-applications = 10000
yarn.scheduler.capacity.node-locality-delay = 40
yarn.scheduler.capacity.root.accessible-node-labels = *
yarn.scheduler.capacity.root.acl_administer_queue = *
yarn.scheduler.capacity.root.capacity = 100
yarn.scheduler.capacity.root.default.acl_administer_jobs = *
yarn.scheduler.capacity.root.default.acl_submit_applications = *
yarn.scheduler.capacity.root.default.capacity = 100
yarn.scheduler.capacity.root.default.maximum-capacity = 100
yarn.scheduler.capacity.root.default.state = RUNNING
yarn.scheduler.capacity.root.default.user-limit-factor = 1
yarn.scheduler.capacity.root.queues = default
Your setting:
yarn.scheduler.capacity.maximum-am-resource-percent = 0.2
Implies:
total vcores(8) x maximum-am-resource-percent(0.2) = 1.6
1.6 gets rounded up to 2 since partial vcores makes no sense. This means you can only have 2 application masters at a time which is why you can only run 2 jobs at a time.
Solution, bump up yarn.scheduler.capacity.maximum-am-resource-percent to a higher value like 0.5.
followings are parameters to control parallel execution are:
spark.executor.instances -> number of executors
spark.executor.cores -> number of cores per executors
spark.task.cpus -> number of tasks per cpu
https://spark.apache.org/docs/latest/submitting-applications.html

RethinkDB: why does rethinkdb service use so much memory?

After encountering situations where I found that rethinkdb service is down for unknown reason, I noticed it uses a lot of memory:
# free -m
total used free shared buffers cached
Mem: 7872 7744 128 0 30 68
-/+ buffers/cache: 7645 226
Swap: 4031 287 3744
# top
top - 23:12:51 up 7 days, 1:16, 3 users, load average: 0.00, 0.00, 0.00
Tasks: 133 total, 1 running, 132 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.0%us, 0.2%sy, 0.0%ni, 99.8%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 8061372k total, 7931724k used, 129648k free, 32752k buffers
Swap: 4128760k total, 294732k used, 3834028k free, 71260k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1835 root 20 0 7830m 7.2g 5480 S 1.0 94.1 292:43.38 rethinkdb
29417 root 20 0 15036 1256 944 R 0.3 0.0 0:00.05 top
1 root 20 0 19364 1016 872 S 0.0 0.0 0:00.87 init
# cat log_file | tail -9
2014-09-22T21:56:47.448701122 0.052935s info: Running rethinkdb 1.12.5 (GCC 4.4.7)...
2014-09-22T21:56:47.452809839 0.057044s info: Running on Linux 2.6.32-431.17.1.el6.x86_64 x86_64
2014-09-22T21:56:47.452969820 0.057204s info: Using cache size of 3327 MB
2014-09-22T21:56:47.453169285 0.057404s info: Loading data from directory /rethinkdb_data
2014-09-22T21:56:47.571843375 0.176078s info: Listening for intracluster connections on port 29015
2014-09-22T21:56:47.587691636 0.191926s info: Listening for client driver connections on port 28015
2014-09-22T21:56:47.587912507 0.192147s info: Listening for administrative HTTP connections on port 8080
2014-09-22T21:56:47.595163724 0.199398s info: Listening on addresses
2014-09-22T21:56:47.595167377 0.199401s info: Server ready
It seems a lot considering the size of the files:
# du -h
4.0K ./tmp
156M .
Do I need to configure a different cache size? Do you think it has something to do with finding the service surprisingly gone? I'm using v1.12.5
There were a few leak in the previous version, the main one being https://github.com/rethinkdb/rethinkdb/issues/2840
You should probably update RethinkDB -- the current version being 1.15.
If you run 1.12, you need to export your data, but that should be the last time you need it since 1.14 introduced seamless migrations.
From Understanding RethinkDB memory requirements - RethinkDB
By default, RethinkDB automatically configures the cache size limit according to the formula (available_mem - 1024 MB) / 2. available_mem
You can change this via a config file as they document, or change it with a size (in MB) from the command line:
rethinkdb --cache-size 2048

Resources