Running OpenMPI on Windows XP - windows

I'm trying to build a simple cluster based on Windows XP. I compiled OpenMPI-1.4.2 successfully, and tools like mpicc and ompi_info work too, but I can't get my mpirun working properly. The only output I can see is
Z:\>orterun --hostfile z:\hosts.txt -np 2 hostname
[host0:04728] Failed to initialize COM library. Error code = -2147417850
[host0:04728] [[8946,0],0] ORTE_ERROR_LOG: Error in file ..\..\openmpi-1.4.2
\orte\mca\ess\hnp\ess_hnp_module.c at line 218
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):
orte_plm_init failed
--> Returned value Error (-1) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
[host0:04728] [[8946,0],0] ORTE_ERROR_LOG: Error in file ..\..\openmpi-1.4.2
\orte\runtime\orte_init.c at line 132
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):
orte_ess_set_name failed
--> Returned value Error (-1) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
[host0:04728] [[8946,0],0] ORTE_ERROR_LOG: Error in file ..\..\..\..\openmpi
-1.4.2\orte\tools\orterun\orterun.c at line 543
Where z:\hosts.txt appears as follows:
host0
host1
Z: is a shared network drive available to both host0 and host1.
What my problem is and how do I fix it?
Upd:
Ok, this problem seems to be fixed. It seems to me that WideCap driver and/or software components causes this error to appear. A "clean" machine runs local task successfully. Anyway, I still cannot run a task within at least 2 machines, I'm getting following message:
Z:\>mpirun --hostfile z:\hosts.txt -np 2 hostname
connecting to host1
username:MAIN\cluster
password:********
Save Credential?(Y/N) y
[host0:04728] This feature hasn't been implemented yet.
[host0:04728] Could not connect to namespace cimv2 on node host1. Error code =-2147217400
--------------------------------------------------------------------------
mpirun was unable to start the specified application as it encountered an error.
More information may be available above.
--------------------------------------------------------------------------
I googled a little and did all the things as described here: http://www.open-mpi.org/community/lists/users/2010/03/12355.php but I'm still getting the same error. Can anyone help me?
Upd2:
Error code -2147217400 might be WMI error WBEM_E_INVALID_PARAMETER (0x80041008) which occures when one of the parameters passed to the WMI call is not correct. Does this mean that the problem is in OpenMPI source code itself? Or maybe it's because of wrong/outdated wincred.h and credui.lib I used while building OpenMPI from the source code?

Related

Error: command "bash" failed with no error message?

I am using terraform on my Mac system, and terraform apply results with below error
Error: command "bash" failed with no error message
on ssm.tf line 7, in data "external" "ssm-dynamic-general":
7: data "external" "ssm-dynamic-general" {
However there is nothing wrong in ssm.tf file, same runs perfectly fine in my another system.
Can some one please let me know what i am missing here?
You might have done what I accidentally did: not follow the external program protocol:
https://www.terraform.io/docs/providers/external/data_source.html#external-program-protocol
In my particular case, I failed to send the errors that were coming from my program to standard error. Instead, those errors were going to standard out.
That's why Terraform wasn't able to report on those errors.
So if you send any and all errors from your program to standard error using > &2, you should be able to see those errors when you run terraform plan.

TFS2015 Test Agent Aborted - PowerShell script completed with errors

I am running a TFS nightly build that for the last few days has not been able to complete all its tests. It fails after several hours with a "Test run is aborted" message. Previous to this the tests always ran successfully, and no major changes(or even minor) have been made to the system that runs these tests.
Information:
Two MStest runs in the build(unit tests)
Timeout is set to 20 hours
Runs for approx. 15 hours before failure
Tests are set to continue on failure
When I look in the TFS log for the latest run it lists the following(2017-04-11T06:42:47.5500707Z):
[warning]DistributedTests: Test run is aborted. Logging details of the run logs.
[warning]DistributedTests: New test run created.
[warning]Test Run queued for Project Collection Build Service
[warning]DistributedTests: Test discovery started.
[warning]DistributedTests: Test Run Discovery Completed . Test run id: 533
[warning]DistributedTests: 290 test cases discovered.
[warning]DistributedTests: Test execution started. Test run id : 533
[warning]DistributedTests: Test run timed out. Test run id : 533
[warning]DistributedTests: Test run aborted. Test run id: 533
[error]The test run was aborted, failing the task.
When I look at the run log(worker_20170410-234426-utc_864.log) I see:
06:42:47.659516 BaseLogger.LogConsoleMessage(scope.JobId =
7ced7f31-e360-47f3-b334-ef20faeaf000, message = ##[error]The test run
was aborted, failing the task.) 06:42:47.659516
Microsoft.TeamFoundation.DistributedTask.Agent.Common.AgentExecutionTerminationException:
PowerShell script completed with errors. at
Microsoft.TeamFoundation.DistributedTask.Handlers.PowerShellHandler.Execute(ITaskContext
context, CancellationToken cancellationToken) at
Microsoft.TeamFoundation.DistributedTask.Worker.JobRunner.RunTask(ITaskContext
context, TaskWrapper task, CancellationTokenSource tokenSource)
In the test log, I don't see any errors in the VS, just a warning about not able to connect(I see these often):
W, 2060, 5, 2017/04/10, 16:26:03.595, XXXTESTING\QTController.exe,
Test of LoadTestResultConnectString failed: A network-related or
instance-specific error occurred while establishing a connection to
SQL Server. The server was not found or was not accessible. Verify
that the instance name is correct and that SQL Server is configured to
allow remote connections. (provider: SQL Network Interfaces, error: 26
- Error Locating Server/Instance Specified)
I also see an error thrown in the Application Event log at the same time:
The description for Event ID 0 from source Application cannot be
found. Either the component that raises this event is not installed on
your local computer or the installation is corrupted. You can install
or repair the component on the local computer.
If the event originated on another computer, the display information
had to be saved with the event.
The following information was included with the event:
Error Handler Exception: System.ServiceModel.CommunicationException:
There was an error reading from the pipe: The pipe has been ended.
(109, 0x6d). ---> System.IO.IOException: The read operation failed,
see inner exception. ---> System.ServiceModel.CommunicationException:
There was an error reading from the pipe: The pipe has been ended.
(109, 0x6d). ---> System.IO.PipeException: There was an error reading
from the pipe: The pipe has been ended. (109, 0x6d).....
the message resource is present but the message is not found in the
string/message table
The issue is that I really don't know how to interpret these messages, each log just says "test run was aborted, failing the task", I'm not even certain the powershell issue is what caused it. I'm also not sure that the error thrown in the application log is related, though it was thrown at exactly the same time that the run failed.
It's also difficult to research this issue, when you really don't know what's causing the test agent to fail. There are posts related to VS, and to the TFS Test Agent, but these don't strike me as related issues, and of course there is this somewhat unhelpful post about the Powershell message.
Has anyone seen this sort of issue before? I don't think anything on my build server has changed over the last few days(maybe updates...), what do you think would cause an issue like this to occur?
If you look at the failed build(containing tests) after it is aborted in the "Build" section of TFS, its says it was "Aborted", that's it... If you look at results of the build(containing tests) in the "Test" section of TFS it specified that the test run "Exceeded Timeout".
Apparently MSTest was running up against the default value of this little gem. I think it defaults to 8 hours when not specified, but I'm not too sure about this. Anyways I set the following setting in my "Default.testsettings" file:
<?xml version="1.0" encoding="utf-8"?>
<TestSettings name="TestSettings1">
<Execution>
<Timeouts runTimeout="200000000" />
</Execution>
</TestSettings>
Seems to resolve the issue. Tests runs successfully and no longer time out.

How to determine location in code where segfault occurred using syslog message?

I have some code that can't be run under GDB because it's an embedded system. However, in the syslog I will occasionally see the following:
kernel: nam[13986]: segfault at b579000 ip b71737dc sp b5120c9c error 4 in libc-2.5.so[b7102000+13f000]
Is there some way to find where in the code this error occurred using the numbers listed in the error output above?
Yes, ip stands for "instruction pointer", and it's the location of the crash. In the quoted message, it's 0xb71737dc.

How to debug Errno::EIO error in Chef recipe using Chef::Provider::Git

I'm trying to use chef to check out a git repo to a windows client node.
This seems simple enough and I've got the following resource definition:
git "C:\\pathtocheckout" do
repo "https://gitserver/repo.git"
action [ :checkout, :sync]
end
But when this is reached by chef-client I get:
Errno::EIO: git[C:\pathtocheckout] (cookbook_name::test line 21) had an error: Errno::EIO: Input/output error - CreateProcessW
I've had a look at the stacktrace produced and it appears to be something to do with creating a process to run the git command - but this is the limit of my knowledge.
I've made sure git is installed on on Path, removed all other recipes from the run list, running as a different admin user and I've tried different repositories but all with the same error.
So I'm pretty stumped - anyone got a way I can dig into this error and see what is going on?

OS X NSTask error 22

I am building a simple package on OS X using pkgbuild that consists of a folder of stuff and the pre/postinstall scripts. When I try to execute the resulting package, the installer fails with the following message in the log:
Nov 1 13:28:11 localhost installd[631]: ./preinstall: 2013-11-01 13:28:11.074 installd[637:203] * NSTask: Task create for path '/tmp/PKInstallSandbox.P6mPx2/Scripts/com.xyz.utility.pkg.TWwYct/preinstall' failed: 22, "Invalid argument". Terminating temporary process.
The installer is running as root. The problem does not appear to be the contents of the scripts since they fail even after I cut them down to a simple "exit 0" with the interpreter declaration. This issues occurs on 10.8.2 and 10.8.4.
The issue looks like an exception that is thrown from within an NSTask object, but all I get is this 22 error code and the "Invalid argument" message. I think this message might refer to an NSInvalidArgumentException. I made sure the scripts are indeed being placed at the temporary location listed in the error message and they are there with the correct permissions.
Any ideas as to what causes this type of error message? I found several references to this error when I did some searching but there didn't seem to be a unifying cause or solution.
Add #!/bin/sh header into the preinstall script.

Resources