jittor多卡训练报错

Describe the bug

根据jittor文档尝试多卡训练:jittor.mpi — Jittor 1.3.7.12 文档
在指定特定显卡的多卡训练一步中报错CUDA_VISIBLE_DEVICES="2,3" mpirun -np 2 python -m jittor.test.test_resnet

Full Log

(jittor) ubuntu@yijie-163:~$ CUDA_VISIBLE_DEVICES="2,3" mpirun -np 2 python -m jittor.test.test_resnet
[i 0517 07:41:07.838178 40 compiler.py:955] Jittor(1.3.7.16) src: /home/ubuntu/anaconda3/envs/jittor/lib/python3.10/site-packages/jittor
[i 0517 07:41:07.840590 40 compiler.py:956] g++ at /usr/bin/g++(9.4.0)
[i 0517 07:41:07.840685 40 compiler.py:957] cache_path: /home/ubuntu/.cache/jittor/jt1.3.7/g++9.4.0/py3.10.11/Linux-5.15.0-5xbb/IntelRXeonRCPUxd5/default
[i 0517 07:41:07.859453 40 install_cuda.py:93] cuda_driver_version: [11, 6]
[i 0517 07:41:07.862853 40 __init__.py:411] Found /home/ubuntu/.cache/jittor/jtcuda/cuda11.2_cudnn8_linux/bin/nvcc(11.2.152) at /home/ubuntu/.cache/jittor/jtcuda/cuda11.2_cudnn8_linux/bin/nvcc.
[i 0517 07:41:07.865908 40 __init__.py:411] Found addr2line(2.34) at /usr/bin/addr2line.
[i 0517 07:41:08.376265 40 compiler.py:1010] cuda key:cu11.2.152_sm_70
[i 0517 07:41:08.759735 40 __init__.py:227] Total mem: 220.28GB, using 16 procs for compiling.
[i 0517 07:41:09.123535 40 jit_compiler.cc:28] Load cc_path: /usr/bin/g++
[i 0517 07:41:09.936055 40 init.cc:62] Found cuda archs: [70,]
[i 0517 07:41:10.066881 40 __init__.py:411] Found mpicc(4.0.3) at /usr/bin/mpicc.
[i 0517 07:41:10.142917 92 compiler.py:955] Jittor(1.3.7.16) src: /home/ubuntu/anaconda3/envs/jittor/lib/python3.10/site-packages/jittor
[i 0517 07:41:10.145571 92 compiler.py:956] g++ at /usr/bin/g++(9.4.0)
[i 0517 07:41:10.145668 92 compiler.py:957] cache_path: /home/ubuntu/.cache/jittor/jt1.3.7/g++9.4.0/py3.10.11/Linux-5.15.0-5xbb/IntelRXeonRCPUxd5/default
[i 0517 07:41:10.165575 92 install_cuda.py:93] cuda_driver_version: [11, 6]
[i 0517 07:41:10.169115 92 __init__.py:411] Found /home/ubuntu/.cache/jittor/jtcuda/cuda11.2_cudnn8_linux/bin/nvcc(11.2.152) at /home/ubuntu/.cache/jittor/jtcuda/cuda11.2_cudnn8_linux/bin/nvcc.
[i 0517 07:41:10.172260 92 __init__.py:411] Found addr2line(2.34) at /usr/bin/addr2line.
[i 0517 07:41:10.621586 92 compiler.py:1010] cuda key:cu11.2.152_sm_70
[i 0517 07:41:10.949542 92 __init__.py:227] Total mem: 220.28GB, using 16 procs for compiling.
[i 0517 07:41:11.283292 92 jit_compiler.cc:28] Load cc_path: /usr/bin/g++
[i 0517 07:41:11.900170 92 init.cc:62] Found cuda archs: [70,]
[i 0517 07:41:12.036513 92 __init__.py:411] Found mpicc(4.0.3) at /usr/bin/mpicc.
[i 0517 07:41:12.390643 92 compile_extern.py:438] installing nccl...
/bin/sh: 1: make: not found
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/jittor/lib/python3.10/runpy.py", line 187, in _run_module_as_main
    mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
  File "/home/ubuntu/anaconda3/envs/jittor/lib/python3.10/runpy.py", line 110, in _get_module_details
    __import__(pkg_name)
  File "/home/ubuntu/anaconda3/envs/jittor/lib/python3.10/site-packages/jittor/__init__.py", line 25, in <module>
    from . import compile_extern
  File "/home/ubuntu/anaconda3/envs/jittor/lib/python3.10/site-packages/jittor/compile_extern.py", line 587, in <module>
    setup_nccl()
  File "/home/ubuntu/anaconda3/envs/jittor/lib/python3.10/site-packages/jittor/compile_extern.py", line 463, in setup_nccl
    nccl_home = install_nccl(nccl_path)
  File "/home/ubuntu/anaconda3/envs/jittor/lib/python3.10/site-packages/jittor/compile_extern.py", line 443, in install_nccl
    run_cmd(f"CC=\"{cc_path}\" CXX=\"{cc_path}\" make -j8 src.build CUDA_HOME='{cuda_home}' NVCC_GENCODE='{arch_flag} --cudart=shared ' ", cwd=dirname)
  File "/home/ubuntu/anaconda3/envs/jittor/lib/python3.10/site-packages/jittor_utils/__init__.py", line 188, in run_cmd
    raise Exception(err_msg)
Exception: Run cmd failed: CC="/usr/bin/g++" CXX="/usr/bin/g++" make -j8 src.build CUDA_HOME='/home/ubuntu/.cache/jittor/jtcuda/cuda11.2_cudnn8_linux' NVCC_GENCODE=' -arch=compute_70  -code=sm_70  --cudart=shared '
[i 0517 07:41:12.464422 40 compile_extern.py:438] installing nccl...
/bin/sh: 1: make: not found
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/jittor/lib/python3.10/runpy.py", line 187, in _run_module_as_main
    mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
  File "/home/ubuntu/anaconda3/envs/jittor/lib/python3.10/runpy.py", line 110, in _get_module_details
    __import__(pkg_name)
  File "/home/ubuntu/anaconda3/envs/jittor/lib/python3.10/site-packages/jittor/__init__.py", line 25, in <module>
    from . import compile_extern
  File "/home/ubuntu/anaconda3/envs/jittor/lib/python3.10/site-packages/jittor/compile_extern.py", line 587, in <module>
    setup_nccl()
  File "/home/ubuntu/anaconda3/envs/jittor/lib/python3.10/site-packages/jittor/compile_extern.py", line 463, in setup_nccl
    nccl_home = install_nccl(nccl_path)
  File "/home/ubuntu/anaconda3/envs/jittor/lib/python3.10/site-packages/jittor/compile_extern.py", line 443, in install_nccl
    run_cmd(f"CC=\"{cc_path}\" CXX=\"{cc_path}\" make -j8 src.build CUDA_HOME='{cuda_home}' NVCC_GENCODE='{arch_flag} --cudart=shared ' ", cwd=dirname)
  File "/home/ubuntu/anaconda3/envs/jittor/lib/python3.10/site-packages/jittor_utils/__init__.py", line 188, in run_cmd
    raise Exception(err_msg)
Exception: Run cmd failed: CC="/usr/bin/g++" CXX="/usr/bin/g++" make -j8 src.build CUDA_HOME='/home/ubuntu/.cache/jittor/jtcuda/cuda11.2_cudnn8_linux' NVCC_GENCODE=' -arch=compute_70  -code=sm_70  --cudart=shared '
[yijie-163:209560] *** Process received signal ***
[yijie-163:209560] Signal: Aborted (6)
[yijie-163:209560] Signal code:  (-6)
[yijie-163:209560] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7fbb39074420]
[yijie-163:209560] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7fbb38d5700b]
[yijie-163:209560] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7fbb38d36859]
[yijie-163:209560] [ 3] /home/ubuntu/anaconda3/envs/jittor/bin/../lib/libstdc++.so.6(+0xb135a)[0x7fbb3794435a]
[yijie-163:209560] [ 4] /home/ubuntu/anaconda3/envs/jittor/bin/../lib/libstdc++.so.6(+0xb13c5)[0x7fbb379443c5]
[yijie-163:209560] [ 5] /home/ubuntu/.cache/jittor/jt1.3.7/g++9.4.0/py3.10.11/Linux-5.15.0-5xbb/IntelRXeonRCPUxd5/default/cu11.2.152_sm_70/jittor_core.cpython-310-x86_64-linux-gnu.so(+0x247f84)[0x7fbb35597f84]
[yijie-163:209560] [ 6] /lib/x86_64-linux-gnu/libc.so.6(+0x468a7)[0x7fbb38d5a8a7]
[yijie-163:209560] [ 7] /lib/x86_64-linux-gnu/libc.so.6(on_exit+0x0)[0x7fbb38d5aa60]
[yijie-163:209560] [ 8] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xfa)[0x7fbb38d3808a]
[yijie-163:209560] [ 9] python[0x5854ee]
[yijie-163:209560] *** End of error message ***
[yijie-163:209559] *** Process received signal ***
[yijie-163:209559] Signal: Aborted (6)
[yijie-163:209559] Signal code:  (-6)
[yijie-163:209559] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7f7c026df420]
[yijie-163:209559] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f7c023c200b]
[yijie-163:209559] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f7c023a1859]
[yijie-163:209559] [ 3] /home/ubuntu/anaconda3/envs/jittor/bin/../lib/libstdc++.so.6(+0xb135a)[0x7f7c00faf35a]
[yijie-163:209559] [ 4] /home/ubuntu/anaconda3/envs/jittor/bin/../lib/libstdc++.so.6(+0xb13c5)[0x7f7c00faf3c5]
[yijie-163:209559] [ 5] /home/ubuntu/.cache/jittor/jt1.3.7/g++9.4.0/py3.10.11/Linux-5.15.0-5xbb/IntelRXeonRCPUxd5/default/cu11.2.152_sm_70/jittor_core.cpython-310-x86_64-linux-gnu.so(+0x247f84)[0x7f7bfaa5cf84]
[yijie-163:209559] [ 6] /lib/x86_64-linux-gnu/libc.so.6(+0x468a7)[0x7f7c023c58a7]
[yijie-163:209559] [ 7] /lib/x86_64-linux-gnu/libc.so.6(on_exit+0x0)[0x7f7c023c5a60]
[yijie-163:209559] [ 8] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xfa)[0x7f7c023a308a]
[yijie-163:209559] [ 9] python[0x5854ee]
[yijie-163:209559] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node yijie-163 exited on signal 6 (Aborted).

log picture


Minimal Reproduce

CUDA_VISIBLE_DEVICES=“2,3” mpirun -np 2 python -m jittor.test.test_resnet
系统:Ubuntu20.04.02 LTS
显卡:TITAN V

Expected behavior

正常训练