复现基于层次布局感知图卷积网络的图像美学评估方法HLA-GCN时报错

我使用AADB数据集训练模型,训练时报错:
[i 0616 08:08:25.320000 52 cuda_flags.cc:32] CUDA enabled.
@ [Training Model] Arch = [resnet50_HLAGCN]; Dataset = [aadb]
@ LR = [0.01]; Total epoch = [20]; Batch size = [8]
@ weight_decay = [0.0001]; momentum = [0.9]; workers = [8]
@ model save dir: results/resnet50_HLAGCN_v0_aadb
@ model save period: 4
Preprocessing dataset…
AADB dataset info preloaded in ./preprocess/: #8435 train #498 val #996 test
Preloading file saved!
[w 0616 08:10:07.622000 52 init.py:1118] load parameter fc.weight failed …
[w 0616 08:10:07.625000 52 init.py:1118] load parameter fc.bias failed …
[w 0616 08:10:07.627000 52 init.py:1136] load total 267 params, 2 failed
=> Start training #Ep 1 /20
08:10:07->Ep:[1][ 0/8435] - Net:59.5 - Load:56.1 - loss_avg:nan

Compiling Operators(52/52) used: 2.02s eta: 0s
08:10:07->Ep:[1][ 500/8435] - Net:0.5 - Load:0.1 - loss_avg:nan
08:10:07->Ep:[1][1000/8435] - Net:0.5 - Load:0.1 - loss_avg:nan
—> Train: 8.26 min/epoch, train loss: nan - lr: 0.01000
Traceback (most recent call last):
File “D:\Python\SetUpPath\lib\runpy.py”, line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File “D:\Python\SetUpPath\lib\runpy.py”, line 86, in run_code
exec(code, run_globals)
File “D:\Python\Python_Demo\图像评分模型\hlagcn-jittor-main\utils_jittor\train_jittor.py”, line 237, in
main()
File “D:\Python\Python_Demo\图像评分模型\hlagcn-jittor-main\utils_jittor\train_jittor.py”, line 40, in main
main_worker(args)
File “D:\Python\Python_Demo\图像评分模型\hlagcn-jittor-main\utils_jittor\train_jittor.py”, line 150, in main_worker
val_loss, val_acc_aes = val_test_process(val_loader, model, criterions, args)
File "D:\Python\SetUpPath\lib\site-packages\jittor_init
.py", line 291, in inner
ret = func(*args, **kw)
File “D:\Python\Python_Demo\图像评分模型\hlagcn-jittor-main\utils_jittor\train_jittor.py”, line 224, in val_test_process
metrics = cal_metrics(scores_hist, labels_hist, args.bins)
File “D:\Python\Python_Demo\图像评分模型\hlagcn-jittor-main\utils_jittor\util.py”, line 143, in cal_metrics
plcc, _ = pearsonr(scores_mean, labels_mean)
File “D:\Python\SetUpPath\lib\site-packages\scipy\stats_stats_py.py”, line 4090, in pearsonr
normxm = linalg.norm(xm)
File “D:\Python\SetUpPath\lib\site-packages\scipy\linalg_misc.py”, line 145, in norm
a = np.asarray_chkfinite(a)
File “D:\Python\SetUpPath\lib\site-packages\numpy\lib\function_base.py”, line 603, in asarray_chkfinite
raise ValueError(
ValueError: array must not contain infs or NaNs

不知道哪个部分出了问题。

您好,异常报出的是 np.array 里出现了 inf 或者 NaN,您先可以检查一下 nan 的来源。

你好,我一步步调试发现,把数据输入模型后,模型返回的结果是个tuple,其中0是由明确的浮点数组成,而1,2都是nan。但让我疑惑的是后面直接取了第三个全是nan的,而不是取第一个。


然后我看看了看模型的输出

我明白为什么取第三个元素了。但我疑惑的是为什么o2计算出来的是nan?

那应该是网络计算过程中就出现 NAN 了。

结合您的运行输出这一部分

=> Start training #Ep 1 /20
08:10:07->Ep:[1][ 0/8435] - Net:59.5 - Load:56.1 - loss_avg:nan

看上去是第一个迭代就出现了 NAN。而且之前也出现了读取参数失败的提示。

是否是预训练参数的读取有误呢?可以先和代码作者交流确认下。

我想复现的是这篇文章,但并没有联系方式。

github上提个issue?

我八小时前就已经提了,但没有回应。

Jittor 有检查 NAN 的机制,您可以参考以下文档,找到最早出现 NAN 的相关代码

https://cg.cs.tsinghua.edu.cn/jittor/assets/docs/Jittor调试技巧.html#naninf