jittor:1.3.1.22
NVIDIA-SMI 460.91.03 Driver Version: 460.91.03 CUDA Version: 11.2
GPU:K80
其中另外有一块A100工作正常
我在尝试将数据放到k80 gpu就会报错
>>> jt.display_memory_info()
[i 1212 14:19:58.495313 76 <stdin>:1]
=== display_memory_info ===
total_cpu_ram: 1008GB total_cuda_ram: 11.92GB
hold_vars: 0 lived_vars: 0 lived_ops: 0
update queue: 0/0
name: sfrl is_cuda: 1 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: sfrl is_cuda: 0 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
name: sfrl is_cuda: 0 used: 0 B(-nan%) unused: 0 B(-nan%) total: 0 B
cpu&gpu: 0 B gpu: 0 B cpu: 0 B
free: cpu(671.4GB) gpu( 0 B)
===========================
>>> var_data = jt.array(data)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/greatofdream/.local/lib/python3.8/site-packages/jittor/__init__.py", line 315, in array
return ops.array(data)
RuntimeError: Wrong inputs arguments, Please refer to examples(help(jt.ops.array)).
Types of your inputs are:
self = module,
args = (list, ),
The function declarations are:
VarHolder* array__(PyObject* obj)
Failed reason:[f 1212 14:23:36.781898 76 helper_cuda.h:126] CUDA error at /home/greatofdream/.local/lib/python3.8/site-packages/jittor/src/mem/allocator/cuda_host_allocator.cc:22 co
de=222( cudaErrorUnsupportedPtxVersion ) cudaMallocHost(&ptr, size)
我感觉是不是因为K80的compute capalicity 是sm_37,但是cuda11的默认ptxas使用sm_52,需要动态编译时指定吗?
目前 Jittor 考虑了一台机器不同型号显卡的情况,编译时 jittor 会检查每张显卡支持的计算能力(compute capability),为当前运行的显卡选择最合适的 sm。但是这样的硬件环境很少,我们只在有限的情况下进行了测试。
我们有一台机器同时连接了 3 张 RTX TITAN (sm 70) 和 1 张 V100 (sm 75),运行测试结果如下
lzhengning:~$ python3.7 -m jittor.test.test_cuda
[i 1215 10:33:40.424671 52 compiler.py:944] Jittor(1.3.1.23) src: /home/lzhengning/.local/lib/python3.7/site-packages/jittor
[i 1215 10:33:40.427895 52 compiler.py:945] g++ at /usr/bin/g++(7.5.0)
[i 1215 10:33:40.427992 52 compiler.py:946] cache_path: /home/lzhengning/.cache/jittor/jt1.3.1/g++7.5.0/py3.7.7/Linux-4.15.0-1x77/IntelRXeonRCPUx00/default
[i 1215 10:33:40.431651 52 __init__.py:372] Found nvcc(10.2.89) at /usr/local/cuda/bin/nvcc.
[i 1215 10:33:40.483054 52 __init__.py:372] Found gdb(8.1.0) at /usr/bin/gdb.
[i 1215 10:33:40.488534 52 __init__.py:372] Found addr2line(2.30) at /usr/bin/addr2line.
[i 1215 10:33:42.655058 52 compiler.py:997] cuda key:cu10.2.89_sm_70_75
[i 1215 10:33:42.881166 52 __init__.py:187] Total mem: 62.78GB, using 16 procs for compiling.
[i 1215 10:33:42.983332 52 jit_compiler.cc:27] Load cc_path: /usr/bin/g++
[i 1215 10:33:44.740691 52 init.cc:61] Found cuda archs: [70,75,]
[i 1215 10:33:44.931357 52 __init__.py:372] Found mpicc(2.1.1) at /usr/bin/mpicc.
[i 1215 10:33:45.077898 52 compile_extern.py:29] found /usr/local/cuda/include/cublas.h
[i 1215 10:33:45.089557 52 compile_extern.py:29] found /usr/lib/x86_64-linux-gnu/libcublas.so
[i 1215 10:33:45.089759 52 compile_extern.py:29] found /usr/lib/x86_64-linux-gnu/libcublasLt.so.10
[i 1215 10:33:45.384640 52 compile_extern.py:29] found /usr/local/cuda/include/cudnn.h
[i 1215 10:33:45.402908 52 compile_extern.py:29] found /usr/local/cuda/lib64/libcudnn.so
[i 1215 10:33:46.635188 52 compile_extern.py:29] found /usr/local/cuda/include/curand.h
[i 1215 10:33:46.680405 52 compile_extern.py:29] found /usr/local/cuda/lib64/libcurand.so
[i 1215 10:33:46.758951 52 cuda_flags.cc:32] CUDA enabled.
.[i 1215 10:33:46.781616 52 cuda_flags.cc:32] CUDA enabled.
.[i 1215 10:33:46.785080 52 cuda_flags.cc:32] CUDA enabled.
其中关键的信息是 Found cuda archs: [70,75,]
,说明 Jittor 找到了正确的 sm。在您这边是什么情况呢?
我指定了K80,在当前cuda11.2下jittor可以找到sm_37
$ JT_SYNC=1 trace_py_var=3 CUDA_VISIBLE_DEVICES="1" python3 jittortest.py
[i 1214 13:51:05.415760 76 compiler.py:944] Jittor(1.3.1.22) src: /home/greatofdream/.local/lib/python3.8/site-packages/jittor
[i 1214 13:51:05.420535 76 compiler.py:945] g++ at /opt/gentoo/opt/cuda/bin/g++(8.4.0)
[i 1214 13:51:05.420675 76 compiler.py:946] cache_path: /home/greatofdream/.cache/jittor/jt1.3.1/g++8.4.0/py3.8.11/Linux-5.10.0-9xc9/AMDEPYC770264-x23/default
[i 1214 13:51:05.426728 76 __init__.py:372] Found nvcc(11.2.152) at /opt/gentoo/opt/cuda/bin/nvcc.
[i 1214 13:51:05.533214 76 __init__.py:372] Found gdb(10.2) at /opt/gentoo/usr/bin/gdb.
[i 1214 13:51:05.538406 76 __init__.py:372] Found addr2line(2.37) at /opt/gentoo/usr/bin/addr2line.
[i 1214 13:51:07.403562 76 compiler.py:997] cuda key:cu11.2.152_sm_37_80
[i 1214 13:51:07.954068 76 __init__.py:187] Total mem: 1007.75GB, using 16 procs for compiling.
[i 1214 13:51:08.065866 76 jit_compiler.cc:27] Load cc_path: /opt/gentoo/opt/cuda/bin/g++
[i 1214 13:51:09.734541 76 py_var_tracer.cc:22] Load trace_py_var: 3
[i 1214 13:51:09.735341 76 init.cc:61] Found cuda archs: [37,]
[i 1214 13:51:09.824050 76 __init__.py:372] Found mpicc(4.0.4) at /opt/gentoo/usr/bin/mpicc.
[i 1214 13:51:10.031049 76 compile_extern.py:29] found /opt/gentoo/opt/cuda/include/cublas.h
[i 1214 13:51:10.042530 76 compile_extern.py:29] found /opt/gentoo/opt/cuda/lib64/libcublas.so
[i 1214 13:51:10.042772 76 compile_extern.py:29] found /opt/gentoo/opt/cuda/lib64/libcublasLt.so.11
[i 1214 13:51:10.248880 76 compile_extern.py:29] found /opt/gentoo/opt/cuda/include/cudnn.h
[i 1214 13:51:10.269499 76 compile_extern.py:29] found /opt/gentoo/opt/cuda/lib64/libcudnn.so.8
[i 1214 13:51:10.269706 76 compile_extern.py:29] found /opt/gentoo/opt/cuda/lib64/libcudnn_ops_infer.so.8
[i 1214 13:51:10.275331 76 compile_extern.py:29] found /opt/gentoo/opt/cuda/lib64/libcudnn_ops_train.so.8
[i 1214 13:51:10.275903 76 compile_extern.py:29] found /opt/gentoo/opt/cuda/lib64/libcudnn_cnn_infer.so.8
[i 1214 13:51:10.337467 76 compile_extern.py:29] found /opt/gentoo/opt/cuda/lib64/libcudnn_cnn_train.so.8
[i 1214 13:51:10.778860 76 compile_extern.py:29] found /opt/gentoo/opt/cuda/include/curand.h
[i 1214 13:51:10.805477 76 compile_extern.py:29] found /opt/gentoo/opt/cuda/lib64/libcurand.so
[i 1214 13:51:10.951753 76 cuda_flags.cc:32] CUDA enabled.
我使用nvdia的devicequery查询了这块卡的compution capacity,确实是sm_37
Device 1: "Tesla K80"
CUDA Driver Version / Runtime Version 11.2 / 11.2
CUDA Capability Major/Minor version number: 3.7
Total amount of global memory: 12207 MBytes (12799574016 bytes)
(013) Multiprocessors, (192) CUDA Cores/MP: 2496 CUDA Cores
GPU Max Clock rate: 824 MHz (0.82 GHz)
Memory Clock rate: 2505 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 1572864 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total shared memory per multiprocessor: 114688 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Managed Memory: Yes
Device supports Compute Preemption: No
Supports Cooperative Kernel Launch: No
Supports MultiDevice Co-op Kernel Launch: No
Device PCI Domain ID / Bus ID / location ID: 0 / 131 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
我目前在尝试寻找jittor内动态编译cuda的部分,可能是这里有bug?
目前通过
CUDA_VISIBLE_DEVICES="0"
选中A100可以,CUDA_VISIBLE_DEVICES="0"
选中K80不行,可能还是和cuda版本相关,虽然cuda11.2文档说支持sm_37,也许还是有一些潜在问题
1 个赞
使用nvcc指定-arch=sm_37
编译的cuda程序可以在K80上运行,我尝试的程序来自于6_matrix,我在CMakeLists.txt手动指定了set(CMAKE_CUDA_FLAGS "-arch=sm_37 -g -G -O3")
,make时
/opt/gentoo/opt/cuda/bin/nvcc -forward-unknown-to-host-compiler -I/home/greatofdream/jittor/CUDA_Freshman/./include -arch=sm_37 -g -G -O3 -std=c++17 -MD -MT 6_sum_matrix/CMakeFiles/sum_matrix.dir/sum_matrix.cu.o -MF CMakeFiles/sum_matrix.dir/sum_matrix.cu.o.d -x cu -c /home/greatofdream/jittor/CUDA_Freshman/6_sum_matrix/sum_matrix.cu -o CMakeFiles/sum_matrix.dir/sum_matrix.cu.o
说明确实指定nvcc为sm_37,程序执行输出
CUDA_VISIBLE_DEVICES="1" ./6_sum_matrix/sum_matrix
strating...
Using device 0: Tesla K80
CPU Execution Time elapsed 0.015428 sec
GPU Execution configuration<<<(128,128),(32,32)>>> Time elapsed 0.003202 sec
Check result success!
GPU Execution configuration<<<(524288,1),(32,1)>>> Time elapsed 0.011131 sec
Check result success!
GPU Execution configuration<<<(128,4096),(32,1)>>> Time elapsed 0.011130 sec
Check result success!
1 个赞