Torque/PBS can not find munge.socket.2 - cluster-computing

I am trying to create a virtual cluster for my MPI classes so i can work at home and not be at the university labs the whole day. I can not figure out for 2 days now how to fix this problem with munge.
The output of the problem i have is this
[root#localhost lumx]# qmgr -c "set server acl_hosts = mars"
munge_encode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory (6)
Unable to communicate with localhost(127.0.0.1)
Communication failure.
qmgr: cannot connect to server (errno=15009) munge executable not found, unable to authenticate
My hosts file looks like this
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
127.0.0.1 mpimaster localhost.localdomain localhost
I am tried to read as much as I could and I ended up with these guides,Getting started with Open MPI on Fedora, Installing Torque/PBS job scheduler on Ubuntu 14.04 LTS,TORQUE Arch Linux, http://juanjose.garciaripoll.com/fedora-cluster/5-torque-pbs-queue

I solved it. I just have to force it to start and use syslogs because there are some permission problems for now.
The command I use is
munged --force --syslog

For Opensuse Leap 15.X, you need to start the munge service by
sudo service munge start
sudo service munge status
If the service is shown as active, you can verify it by now typing:
munge -n
That should return some results like below:
MUNGE:AwQFAAAHiPEv+E6Ezy2HVHUwo5PZ2fkbbr4yP7pZZA9Yo6BWQdAFGVRNkhNbRkvd9zNAvnpg0iQzkjg+WW6HdIix48nKrA0QnjispII4RoT1UqZLh7ybIl5/WIvd3ta85v1KV8A=:

Related

Hadoop setup issue: "ssh: Could not resolve hostname now.: No address associated with hostname"

When I build hadoop cluster based on vmware, and I use sbin/start-dfs.sh command, I meet the problem about ssh. It says,
ssh: Could not resolve hostname now.: No address associated with hostname
I have used vi /etc/hosts command to check the hostname and IP address, and vi /etc/profile command. I ensure that there is no fault.
Few suggestions
Check if the hostnames in hdfs-site.xml is set correctly. If you are running with single host setup, and you set namenode host as localhost, you need to make sure localhost mapped to 127.0.0.1 in your /etc/hosts. If you are setting multiple nodes, make sure you use FQDN of each host in your configuration, and make sure each FQDN mapped to the correct IP address in /etc/hosts.
Setup passwordless SSH. Note start-dfs.sh requires that you have passwordless SSH setup from the host where you run this command to rest of cluster nodes. Verify this by ssh hostx date and it doesn't ask for a password.
Check the hostname in the error message (maybe you did not paste the complete log), for the problematic hostname, run SSH command manually to make sure it can be resolved. If not, check /etc/hosts. A common /etc/hosts setup looks like
127.0.0.1 localhost localhost.localdomain
::1 localhost localhost.localdomain
172.16.151.224 host1.test.com host1
172.16.152.238 host2.test.com host2
172.16.153.108 host3.test.com host3

gethostbyname fail after switching internet connections

I often (but not always) get the following error when running MPI jobs after switching wifi hosts.
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(498)..............:
MPID_Init(187).....................: channel initialization failed
MPIDI_CH3_Init(89).................:
MPID_nem_init(320).................:
MPID_nem_tcp_init(171).............:
MPID_nem_tcp_get_business_card(418):
MPID_nem_tcp_init(377).............: gethostbyname failed, MacBook-Pro.local (errno 1)
Everything works fine in the coffee shop, and then when I come home, I get the above error. Nothing else has changed.
I've checked the /etc/hosts and /private/etc/hosts files, and they look okay -
##
# Host Database
#
# localhost is used to configure the loopback interface
# when the system is booting. Do not change this entry.
##
127.0.0.1 localhost
255.255.255.255 broadcasthost
I can ping localhost, so the problem isn't exactly that localhost isn't resolved.
Rebooting always fixes the problem, but is there something simple I can do to "reset" my system so that it recognizes local host?
I don't have access to the details of the MPI initialization routines in the code I am running and am not making any explicit calls to gethostname.
I am using MPICH 3.1.4 (built Feb, 2015) and am running OSX 10.10.3
The answer is very simple - here is what seems to work.
I edited the file /etc/hosts (or /private/etc/hosts, in OSX) and added the line
127.0.0.1 macbook-pro.local
so now my hosts files looks like :
##
# Host Database
#
# localhost is used to configure the loopback interface
# when the system is booting. Do not change this entry.
##
127.0.0.1 localhost
255.255.255.255 broadcasthost
127.0.0.1 macbook-pro.local
We were facing this issue intermittently on our CI server. It seems that setting the environment variable MPICH_INTERFACE_HOSTNAME to localhost helped.

Correct configuration of /etc/hosts file for Oracle Forms and Reports on Virtualbox

I am trying to do an install of Oracle Forms and Reports on Weblogic 10.3.6. I am using virtual box and Oracle Linux 6.5.
Forms and Reports keeps failing on 'Executing: opmnctl startproc ias-component..'
After looking through what logs are avaiable and searching the net, I think the problem lies with my etc/hosts file - but I do not understand why or how to fix it.
My etc/hosts file consists of
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost.localdomain6
I also tried changing this file to
127.0.0.1 localhost localhost4
::1 localhost localhost6
Does anyone know what I should have in this file?
Usually you will leave the hosts file with the defaults and use DNS for name resolution. I would verify the DNS settings for your server. Then use ping to test name resolution.

Not able to run commands on Hbase shell

I keep getting INFO ipc.HBaseRPC: Problem connecting to server: /129.10.193.117:49176 when I try to create or disable tables in Hbase. I have googled this error and found couple of answers. All of them say something like "Default installation added a line in /etc/hosts which linked to machine hostname with the IP 127.0.1.1."
Here is my etc/host file
127.0.0.1 localhost
255.255.255.255 broadcasthost
::1 localhost
fe80::1%lo0 localhost
Also my conf/regionservers is pointing to localhost. Does anyone know how to fix this ? After shutting down hbase & Hadoop and then restarting my system, I do not get this error. Not sure why.
Are you hadoop configurations also pointing to localhost ?

Installing Glassfish 3 on Mac OS X 10.8.2 fails in Domain Info setup

I am a Glassfish newbie, though a Java and Unix veteran. I am running the shell script to install Java EE 6 SDK, including Glassfish 3 on my Mac.
bash-3.2# sh java_ee_sdk-6u4-unix.sh
Everything goes along fine until I get to the Domain Info screen. I am sticking with the default info (except for passwords):
Domain Name: domain1
Admin Port: 4848 <- I have verified with netstat that both ports are free
Http Port: 8080 <-
Username: admin
Password: xxxxxxxxxxxx
Service Name: domain1Service
+ Start domain after creation
When I click Next, I get 2 error dialogs that tell me the following:
Admin Port: Host name not found
Http Port: Host name not found
Does anyone know how to get past this?
I confirm, I correct the same problem by changing my hostname to localhost using the command line hostname localhost in linux.
Looks like something is wrong with your network config. Maybe your hosts file is missing the entry for localhost.
Check the file /private/etc/hosts for
127.0.0.1 localhost
You can also try to set your hostname with
sudo hostname localhost
The answer was relatively simple. I had to add the hostname configured with MacOS Settings as an alias for localhost into my /etc/hosts file. I'm not sure where MacOS keeps the host name. But the hostname command (i.e. gethostname) was returning 'airguitar', which is what Glassfish was trying to connect to. Since it wasn't in /etc/hosts, the hostname couldn't be found.
This is a problem that happens on many systems, It's happening on my linux as well. The solution is quite simple, as chuck almost got it.
Check your hosts file, on linux is under /etc/hosts. You'll probably have a file like this:
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
Check your hostname with the command hostname
[root#glassfish1 opt]# hostname
glassfish1
And add this host name to your hosts file like this:
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 glassfish1
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
this did the trick for me.

Resources