co de=222( cudaErrorUnsupportedPtxVersion ) cudaMallocHost(&ptr, size)

greatofdream · 2021 年12 月 12 日 16:20

jittor:1.3.1.22
NVIDIA-SMI 460.91.03 Driver Version: 460.91.03 CUDA Version: 11.2
GPU:K80
其中另外有一块A100工作正常
我在尝试将数据放到k80 gpu就会报错

>>> jt.display_memory_info()                                                                                                                                                          
[i 1212 14:19:58.495313 76 <stdin>:1]                                                                                                                                                 
=== display_memory_info ===                                                                                                                                                           
 total_cpu_ram:  1008GB total_cuda_ram: 11.92GB                                                                                                                                       
 hold_vars: 0 lived_vars: 0 lived_ops: 0                                                                                                                                              
 update queue: 0/0                                                                                                                                                                    
 name: sfrl is_cuda: 1 used:     0 B(-nan%) unused:     0 B(-nan%) total:     0 B                                                                                                     
 name: sfrl is_cuda: 0 used:     0 B(-nan%) unused:     0 B(-nan%) total:     0 B                                                                                                     
 name: sfrl is_cuda: 0 used:     0 B(-nan%) unused:     0 B(-nan%) total:     0 B                                                                                                     
 cpu&gpu:     0 B gpu:     0 B cpu:     0 B                                                                                                                                           
 free: cpu(671.4GB) gpu(    0 B)                                                                                                                                                      
===========================
>>> var_data = jt.array(data)
Traceback (most recent call last):                                                         
  File "<stdin>", line 1, in <module>                                                      
  File "/home/greatofdream/.local/lib/python3.8/site-packages/jittor/__init__.py", line 315, in array
    return ops.array(data)                   
RuntimeError: Wrong inputs arguments, Please refer to examples(help(jt.ops.array)).                                                                                                   

Types of your inputs are:                    
 self   = module,                            
 args   = (list, ),                          

The function declarations are:               
 VarHolder* array__(PyObject* obj)                                                         

Failed reason:[f 1212 14:23:36.781898 76 helper_cuda.h:126] CUDA error at /home/greatofdream/.local/lib/python3.8/site-packages/jittor/src/mem/allocator/cuda_host_allocator.cc:22  co
de=222( cudaErrorUnsupportedPtxVersion ) cudaMallocHost(&ptr, size)

我感觉是不是因为K80的compute capalicity 是sm_37，但是cuda11的默认ptxas使用sm_52，需要动态编译时指定吗？

lzhengning · 2021 年12 月 15 日 03:07

目前 Jittor 考虑了一台机器不同型号显卡的情况，编译时 jittor 会检查每张显卡支持的计算能力（compute capability），为当前运行的显卡选择最合适的 sm。但是这样的硬件环境很少，我们只在有限的情况下进行了测试。

我们有一台机器同时连接了 3 张 RTX TITAN (sm 70) 和 1 张 V100 (sm 75)，运行测试结果如下

lzhengning:~$ python3.7 -m jittor.test.test_cuda
[i 1215 10:33:40.424671 52 compiler.py:944] Jittor(1.3.1.23) src: /home/lzhengning/.local/lib/python3.7/site-packages/jittor
[i 1215 10:33:40.427895 52 compiler.py:945] g++ at /usr/bin/g++(7.5.0)
[i 1215 10:33:40.427992 52 compiler.py:946] cache_path: /home/lzhengning/.cache/jittor/jt1.3.1/g++7.5.0/py3.7.7/Linux-4.15.0-1x77/IntelRXeonRCPUx00/default
[i 1215 10:33:40.431651 52 __init__.py:372] Found nvcc(10.2.89) at /usr/local/cuda/bin/nvcc.
[i 1215 10:33:40.483054 52 __init__.py:372] Found gdb(8.1.0) at /usr/bin/gdb.
[i 1215 10:33:40.488534 52 __init__.py:372] Found addr2line(2.30) at /usr/bin/addr2line.
[i 1215 10:33:42.655058 52 compiler.py:997] cuda key:cu10.2.89_sm_70_75
[i 1215 10:33:42.881166 52 __init__.py:187] Total mem: 62.78GB, using 16 procs for compiling.
[i 1215 10:33:42.983332 52 jit_compiler.cc:27] Load cc_path: /usr/bin/g++
[i 1215 10:33:44.740691 52 init.cc:61] Found cuda archs: [70,75,]
[i 1215 10:33:44.931357 52 __init__.py:372] Found mpicc(2.1.1) at /usr/bin/mpicc.
[i 1215 10:33:45.077898 52 compile_extern.py:29] found /usr/local/cuda/include/cublas.h
[i 1215 10:33:45.089557 52 compile_extern.py:29] found /usr/lib/x86_64-linux-gnu/libcublas.so
[i 1215 10:33:45.089759 52 compile_extern.py:29] found /usr/lib/x86_64-linux-gnu/libcublasLt.so.10
[i 1215 10:33:45.384640 52 compile_extern.py:29] found /usr/local/cuda/include/cudnn.h
[i 1215 10:33:45.402908 52 compile_extern.py:29] found /usr/local/cuda/lib64/libcudnn.so
[i 1215 10:33:46.635188 52 compile_extern.py:29] found /usr/local/cuda/include/curand.h
[i 1215 10:33:46.680405 52 compile_extern.py:29] found /usr/local/cuda/lib64/libcurand.so
[i 1215 10:33:46.758951 52 cuda_flags.cc:32] CUDA enabled.
.[i 1215 10:33:46.781616 52 cuda_flags.cc:32] CUDA enabled.
.[i 1215 10:33:46.785080 52 cuda_flags.cc:32] CUDA enabled.

其中关键的信息是 Found cuda archs: [70,75,]，说明 Jittor 找到了正确的 sm。在您这边是什么情况呢？

greatofdream · 2021 年12 月 15 日 03:52

我指定了K80，在当前cuda11.2下jittor可以找到sm_37

$ JT_SYNC=1 trace_py_var=3 CUDA_VISIBLE_DEVICES="1" python3 jittortest.py
[i 1214 13:51:05.415760 76 compiler.py:944] Jittor(1.3.1.22) src: /home/greatofdream/.local/lib/python3.8/site-packages/jittor
[i 1214 13:51:05.420535 76 compiler.py:945] g++ at /opt/gentoo/opt/cuda/bin/g++(8.4.0)
[i 1214 13:51:05.420675 76 compiler.py:946] cache_path: /home/greatofdream/.cache/jittor/jt1.3.1/g++8.4.0/py3.8.11/Linux-5.10.0-9xc9/AMDEPYC770264-x23/default
[i 1214 13:51:05.426728 76 __init__.py:372] Found nvcc(11.2.152) at /opt/gentoo/opt/cuda/bin/nvcc.
[i 1214 13:51:05.533214 76 __init__.py:372] Found gdb(10.2) at /opt/gentoo/usr/bin/gdb.
[i 1214 13:51:05.538406 76 __init__.py:372] Found addr2line(2.37) at /opt/gentoo/usr/bin/addr2line.
[i 1214 13:51:07.403562 76 compiler.py:997] cuda key:cu11.2.152_sm_37_80
[i 1214 13:51:07.954068 76 __init__.py:187] Total mem: 1007.75GB, using 16 procs for compiling.
[i 1214 13:51:08.065866 76 jit_compiler.cc:27] Load cc_path: /opt/gentoo/opt/cuda/bin/g++
[i 1214 13:51:09.734541 76 py_var_tracer.cc:22] Load trace_py_var: 3
[i 1214 13:51:09.735341 76 init.cc:61] Found cuda archs: [37,]
[i 1214 13:51:09.824050 76 __init__.py:372] Found mpicc(4.0.4) at /opt/gentoo/usr/bin/mpicc.
[i 1214 13:51:10.031049 76 compile_extern.py:29] found /opt/gentoo/opt/cuda/include/cublas.h
[i 1214 13:51:10.042530 76 compile_extern.py:29] found /opt/gentoo/opt/cuda/lib64/libcublas.so
[i 1214 13:51:10.042772 76 compile_extern.py:29] found /opt/gentoo/opt/cuda/lib64/libcublasLt.so.11
[i 1214 13:51:10.248880 76 compile_extern.py:29] found /opt/gentoo/opt/cuda/include/cudnn.h
[i 1214 13:51:10.269499 76 compile_extern.py:29] found /opt/gentoo/opt/cuda/lib64/libcudnn.so.8
[i 1214 13:51:10.269706 76 compile_extern.py:29] found /opt/gentoo/opt/cuda/lib64/libcudnn_ops_infer.so.8
[i 1214 13:51:10.275331 76 compile_extern.py:29] found /opt/gentoo/opt/cuda/lib64/libcudnn_ops_train.so.8
[i 1214 13:51:10.275903 76 compile_extern.py:29] found /opt/gentoo/opt/cuda/lib64/libcudnn_cnn_infer.so.8
[i 1214 13:51:10.337467 76 compile_extern.py:29] found /opt/gentoo/opt/cuda/lib64/libcudnn_cnn_train.so.8
[i 1214 13:51:10.778860 76 compile_extern.py:29] found /opt/gentoo/opt/cuda/include/curand.h
[i 1214 13:51:10.805477 76 compile_extern.py:29] found /opt/gentoo/opt/cuda/lib64/libcurand.so
[i 1214 13:51:10.951753 76 cuda_flags.cc:32] CUDA enabled.

我使用nvdia的devicequery查询了这块卡的compution capacity，确实是sm_37

Device 1: "Tesla K80"
  CUDA Driver Version / Runtime Version          11.2 / 11.2
  CUDA Capability Major/Minor version number:    3.7
  Total amount of global memory:                 12207 MBytes (12799574016 bytes)
  (013) Multiprocessors, (192) CUDA Cores/MP:    2496 CUDA Cores
  GPU Max Clock rate:                            824 MHz (0.82 GHz)
  Memory Clock rate:                             2505 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 1572864 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        114688 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            No
  Supports Cooperative Kernel Launch:            No
  Supports MultiDevice Co-op Kernel Launch:      No
  Device PCI Domain ID / Bus ID / location ID:   0 / 131 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

我目前在尝试寻找jittor内动态编译cuda的部分，可能是这里有bug?

lzhengning · 2021 年12 月 15 日 05:17

您的代码在 K80 上单卡运行会出错吗？

greatofdream · 2021 年12 月 15 日 06:37

目前通过

CUDA_VISIBLE_DEVICES="0"

选中A100可以，CUDA_VISIBLE_DEVICES="0"选中K80不行，可能还是和cuda版本相关，虽然cuda11.2文档说支持sm_37，也许还是有一些潜在问题

greatofdream · 2021 年12 月 15 日 07:53

使用nvcc指定-arch=sm_37编译的cuda程序可以在K80上运行，我尝试的程序来自于6_matrix，我在CMakeLists.txt手动指定了set(CMAKE_CUDA_FLAGS "-arch=sm_37 -g -G -O3")，make时

/opt/gentoo/opt/cuda/bin/nvcc -forward-unknown-to-host-compiler  -I/home/greatofdream/jittor/CUDA_Freshman/./include -arch=sm_37 -g -G -O3 -std=c++17 -MD -MT 6_sum_matrix/CMakeFiles/sum_matrix.dir/sum_matrix.cu.o -MF CMakeFiles/sum_matrix.dir/sum_matrix.cu.o.d -x cu -c /home/greatofdream/jittor/CUDA_Freshman/6_sum_matrix/sum_matrix.cu -o CMakeFiles/sum_matrix.dir/sum_matrix.cu.o

说明确实指定nvcc为sm_37，程序执行输出

 CUDA_VISIBLE_DEVICES="1" ./6_sum_matrix/sum_matrix
strating...
Using device 0: Tesla K80
CPU Execution Time elapsed 0.015428 sec
GPU Execution configuration<<<(128,128),(32,32)>>> Time elapsed 0.003202 sec
Check result success!
GPU Execution configuration<<<(524288,1),(32,1)>>> Time elapsed 0.011131 sec
Check result success!
GPU Execution configuration<<<(128,4096),(32,1)>>> Time elapsed 0.011130 sec
Check result success!