Getting MPI error without any additional context from my program

  Kiến thức lập trình

When I run a simulation for many time-steps in a particular configuration, I get the following error code:

--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 4 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[exp-4-52:889691] 127 more processes have sent help message help-mpi-api.txt / mpi-abort
[exp-4-52:889691] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[warn] Epoll MOD(1) on fd 36 failed.  Old events were 6; read change was 0 (none); write change was 2 (del): Bad file descriptor
[warn] Epoll MOD(4) on fd 36 failed.  Old events were 6; read change was 2 (del); write change was 0 (none): Bad file descriptor

I suspect what is happening is that my program is throwing an exception (it’s in C++) and that MPI becomes unhappy when I return -1 from my main function. The actual code looks like this:

int main(int ac, char* av[])
{    
    try
    {
        // entire simulation code here
        return 0;
    }
    catch (std::exception &exc)
    {
        std::cout << exc.what() << std::endl;
        return -1;
    }
    catch (...)
    {
        std::cout << "Got exception which wasn't caught" << std::endl;
        return -1;
    }
}

I have the following questions:

  1. Would MPI_ABORT necessarily be called if an exception was thrown in the above try block?
  2. Is it possible that the catch block is not being executed (and therefore not giving me any more details about the error)?
  3. In the lines referencing help-mpi-api.txt and orte_base_help_aggregate are they telling me that there are more error/help messages available somewhere else? If so, where do I find help-mpi-api.txt and the orte_base_help_aggregate parameter?
  4. Are these warnings obviously relevant as a cause of my error, or is it possible that they are a result of erroring out of my program too early?

Theme wordpress giá rẻ Theme wordpress giá rẻ Thiết kế website

LEAVE A COMMENT