计图的gaugan在多卡DCU上分布式运行报错

曙光DCU服务器,Python 3.9.12, Jittor 1.3.6.4
报错信息如下:
Compiling Operators(134/134) used: 2.32s eta: 0s
HIP Warning: kernel (ZN6jittorL4funcEiiiiPfS0_iiiiPiS0) launch (1024) threads out of range (256), add launch_bounds to kernel define or use --gpu-max-threads-per-block recompile program !
[e 0608 06:37:03.944058 96 mem_info.cc:102] appear time → node cnt: {1:35117, }
Traceback (most recent call last):
File “/usr/local/lib/python3.9/runpy.py”, line 188, in _run_module_as_main
mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
File “/usr/local/lib/python3.9/runpy.py”, line 111, in _get_module_details
import(pkg_name)
File “/work/home/acqy7em2bj/codes/gaugan/spade_train.py”, line 52, in
trainer.run_generator_one_step(data_i)
File “/work/home/acqy7em2bj/codes/gaugan/pix2pix_trainer.py”, line 31, in run_generator_one_step
self.optimizer_G.backward(g_loss)
File “/work/home/acqy7em2bj/.local/lib/python3.9/site-packages/jittor/optim.py”, line 144, in backward
jt.sync(params_has_grad)
RuntimeError: [f 0608 06:37:03.944742 96 executor.cc:668]
Execute fused operator(2087/3781) failed.
[JIT Source]: /work/home/acqy7em2bj/.cache/jittor/jt1.3.6/g++7.3.1/py3.9.12/Linux-3.10.0-9x5d/HygonC86728532xac/default/jit/nccl_all_reduce__Tx_float32__JIT_1__JIT_cuda_1__index_t_int32_hash_800246a282edc36a_op.cc
[OP TYPE]: nccl_all_reduce
[Input]: float32[1024,],
[Output]: float32[1024,],
[Async Backtrace]: —
/usr/local/lib/python3.9/runpy.py:188 <_run_module_as_main>
/usr/local/lib/python3.9/runpy.py:111 <_get_module_details>
:1007 <_find_and_load>
:986 <_find_and_load_unlocked>
:680 <_load_unlocked>
:850 <exec_module>
:228 <_call_with_frames_removed>
/work/home/acqy7em2bj/codes/gaugan/spade_train.py:52 <>
/work/home/acqy7em2bj/codes/gaugan/pix2pix_trainer.py:29 <run_generator_one_step>
/work/home/acqy7em2bj/.local/lib/python3.9/site-packages/jittor/init.py:1109 <call>
/work/home/acqy7em2bj/codes/gaugan/models/pix2pix_model.py:50
/work/home/acqy7em2bj/codes/gaugan/models/pix2pix_model.py:136 <compute_generator_loss>
/work/home/acqy7em2bj/codes/gaugan/models/pix2pix_model.py:196 <generate_fake>
/work/home/acqy7em2bj/.local/lib/python3.9/site-packages/jittor/init.py:1109 <call>
/work/home/acqy7em2bj/codes/gaugan/models/networks/generator.py:90
/work/home/acqy7em2bj/.local/lib/python3.9/site-packages/jittor/init.py:1109 <call>
/work/home/acqy7em2bj/codes/gaugan/models/networks/architecture.py:54
/work/home/acqy7em2bj/.local/lib/python3.9/site-packages/jittor/init.py:1109 <call>
/work/home/acqy7em2bj/codes/gaugan/models/networks/normalization.py:102
/work/home/acqy7em2bj/.local/lib/python3.9/site-packages/jittor/init.py:1109 <call>
/work/home/acqy7em2bj/.local/lib/python3.9/site-packages/jittor/nn.py:635
/work/home/acqy7em2bj/.local/lib/python3.9/site-packages/jittor/compile_extern.py:558
[Reason]: [f 0608 06:37:03.919820 96 helper_cuda.h:130] HIP error at /work/home/acqy7em2bj/.cache/jittor/jt1.3.6/g++7.3.1/py3.9.12/Linux-3.10.0-9x5d/HygonC86728532xac/default/jit/nccl_all_reduce__Tx_float32__JIT_1__JIT_cuda_1__index_t_int32_hash_800246a282edc36a_op.cc:45 code=1( unhandled cuda error ) ncclAllReduce(xp, yp, y->num, ncclFloat , ncclSum, comm, 0)
Invalid address access: 0x7f542c76c000, Error code: 1.

KERNEL VMFault !!! <<<<<<

PID: 16849 !!! <<<<<<
=========> STREAM <0x2e81780>: VMFault HSA QUEUE ANALYSIS <=========
=========> STREAM <0x2d22f10>: VMFault HSA QUEUE ANALYSIS <=========
STREAM <0x2d22f10>: >>>>>>>> DUMP KERNEL AQL PACKET <<<<<<<<<
STREAM <0x2d22f10>: header: 770
STREAM <0x2d22f10>: setup: 3
STREAM <0x2d22f10>: workgroup: x:256, y:1, z:1
STREAM <0x2d22f10>: grid: x:256, y:1, z:1
STREAM <0x2d22f10>: group_segment_size: 4216
STREAM <0x2d22f10>: private_segment_size: 384
STREAM <0x2d22f10>: kernel_object: 139999462824128

SUCCESS: FIND SAME KERNEL OBJECT COMMAND IN USE LIST. useIdx: 122
STREAM <0x2d22f10>: >>>>>>>> FIND MATCH KERNEL COMMAND <<<<<<<<<
STREAM <0x2d22f10>: kernel name: _Z43ncclKernel_AllReduce_RING_SIMPLE_Sum_int8_tP11ncclDevComm12ncclWorkElem
STREAM <0x2d22f10>: >>>>>>>> DUMP KERNEL ARGS: size: 72 <<<<<<<<<

00 60 c0 93 54 7f 00 00 b2 08 01 24 00 00 00 00
00 9a 57 a8 53 7f 00 00 00 ba 57 a8 53 7f 00 00
00 04 00 00 00 00 00 00 00 02 00 00 00 00 00 00
00 00 00 00 00 01 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00

STREAM <0x2d22f10>: >>>>>>>> DUMP KERNEL ARGS PTR INFO <<<<<<<<<
STREAM <0x2d22f10>: ptr arg index: 2, ptr: 0x7f53a8579a00
STREAM <0x2d22f10>: host ptr: 0x7f53a8500000, device ptr: 0x7f53a8500000, unaligned ptr: 0x7f53a8500000
STREAM <0x2d22f10>: size byte: 1048576
STREAM <0x2d22f10>: ptr arg index: 3, ptr: 0x7f53a857ba00
STREAM <0x2d22f10>: host ptr: 0x7f53a8500000, device ptr: 0x7f53a8500000, unaligned ptr: 0x7f53a8500000
STREAM <0x2d22f10>: size byte: 1048576

=========> STREAM <0x2fe5cd0>: VMFault HSA QUEUE ANALYSIS <=========
=========> STREAM <0x2de5ee0>: VMFault HSA QUEUE ANALYSIS <=========
=========> STREAM <0x2f1d020>: VMFault HSA QUEUE ANALYSIS <=========

KERNEL VMFault Analysis END !!! <<<<<<

[bf292062f43f:16849] *** Process received signal ***
[bf292062f43f:16849] Signal: Aborted (6)
[bf292062f43f:16849] Signal code: (-6)
[bf292062f43f:16849] [ 0] /lib64/libpthread.so.0(+0xf630)[0x7f59324ec630]
[bf292062f43f:16849] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x7f5931a3c387]
[bf292062f43f:16849] [ 2] /lib64/libc.so.6(abort+0x148)[0x7f5931a3da78]
[bf292062f43f:16849] [ 3] /opt/dtk-22.10.1/hip/lib/libgalaxyhip.so.5(+0x975704)[0x7f59238a7704]
[bf292062f43f:16849] [ 4] /opt/dtk-22.10.1/hip/lib/libgalaxyhip.so.5(+0x97402e)[0x7f59238a602e]
[bf292062f43f:16849] [ 5] /opt/dtk-22.10.1/hip/lib/libgalaxyhip.so.5(+0x9381e6)[0x7f592386a1e6]
[bf292062f43f:16849] [ 6] /lib64/libpthread.so.0(+0x7ea5)[0x7f59324e4ea5]
[bf292062f43f:16849] [ 7] /lib64/libc.so.6(clone+0x6d)[0x7f5931b04b0d]
[bf292062f43f:16849] *** End of error message ***

Compiling Operators(134/134) used: 2.31s eta: 0s

请问你找到了解决方法吗?