When I run a simulation for many time-steps in a particular configuration, I get the following error code:
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 4 in communicator MPI_COMM_WORLD
with errorcode 1.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[exp-4-52:889691] 127 more processes have sent help message help-mpi-api.txt / mpi-abort
[exp-4-52:889691] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[warn] Epoll MOD(1) on fd 36 failed. Old events were 6; read change was 0 (none); write change was 2 (del): Bad file descriptor
[warn] Epoll MOD(4) on fd 36 failed. Old events were 6; read change was 2 (del); write change was 0 (none): Bad file descriptor
I suspect what is happening is that my program is throwing an exception (it’s in C++) and that MPI becomes unhappy when I return -1 from my main function. The actual code looks like this:
int main(int ac, char* av[])
{
try
{
// entire simulation code here
return 0;
}
catch (std::exception &exc)
{
std::cout << exc.what() << std::endl;
return -1;
}
catch (...)
{
std::cout << "Got exception which wasn't caught" << std::endl;
return -1;
}
}
I have the following questions:
- Would
MPI_ABORT
necessarily be called if an exception was thrown in the abovetry
block? - Is it possible that the catch block is not being executed (and therefore not giving me any more details about the error)?
- In the lines referencing
help-mpi-api.txt
andorte_base_help_aggregate
are they telling me that there are more error/help messages available somewhere else? If so, where do I findhelp-mpi-api.txt
and theorte_base_help_aggregate
parameter? - Are these warnings obviously relevant as a cause of my error, or is it possible that they are a result of erroring out of my program too early?