按照 [VeRL For Pytorch](https://gitee.com/ascend/ModelZoo-PyTorch/tree/master/PyTorch/built-in/rl/VeRL_for_PyTorch) 的配置方法安装后,测试 GRPO 脚本时,在 Warmup 时出现可复现的 Failed to initialize the HCCP process 问题:
```
...
ray.exceptions.RayTaskError(RuntimeError): ray::WorkerDict.ref_init_model() (pid=324545, ip=10.0.1.3, actor_id=49a34dc51198d5e78bac638c01000000, repr=<verl.single_controller.ray.base.WorkerDict object at 0xffcfccff1bd0>)
File "/data/verl/verl/single_controller/ray/base.py", line 663, in func
(TaskRunner pid=314652) [validate_config] All configuration checks passed successfully!
(TaskRunner pid=314652) DeprecationWarning: `ray.state.available_resources_per_node` is a private attribute and access will be removed in a future Ray version.
(TaskRunner pid=314652) Size of train dataloader: 58, Size of val dataloader: 1
(TaskRunner pid=314652) Total training steps: 58
(TaskRunner pid=314652) colocated worker base class <class 'verl.single_controller.base.worker.Worker'>
(TaskRunner pid=314652) WARNING:2025-06-25 04:36:00,838:Waiting for register center actor vx7Ymm_register_center to be ready. Elapsed time: 0 seconds out of 300 seconds.
(WorkerDict pid=324767) Skipping monkey patch for Qwen2ForCausalLM as use_fused_kernels is False or fused_kernels_backend is torch
(WorkerDict pid=324545) Model config after override: Qwen2Config {
(WorkerDict pid=324771) Skipping monkey patch for Qwen2ForCausalLM as use_fused_kernels is False or fused_kernels_backend is torch [repeated 6x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
(WorkerDict pid=324545) Skipping monkey patch for Qwen2ForCausalLM as use_fused_kernels is False or fused_kernels_backend is torch
```
六、其他
- 已尝试 pkill -9 python,但仍然可以复现相同的错误;
- npu-smi info 可以正常识别 devices
新值
一、问题现象:
按照 [VeRL For Pytorch](https://gitee.com/ascend/ModelZoo-PyTorch/tree/master/PyTorch/built-in/rl/VeRL_for_PyTorch) 的配置方法安装后,测试 GRPO 脚本时,在 Warmup 时出现可复现的 Failed to initialize the HCCP process 问题:
```
...
ray.exceptions.RayTaskError(RuntimeError): ray::WorkerDict.ref_init_model() (pid=324545, ip=10.0.1.3, actor_id=49a34dc51198d5e78bac638c01000000, repr=<verl.single_controller.ray.base.WorkerDict object at 0xffcfccff1bd0>)
File "/data/verl/verl/single_controller/ray/base.py", line 663, in func
(TaskRunner pid=314652) [validate_config] All configuration checks passed successfully!
(TaskRunner pid=314652) DeprecationWarning: `ray.state.available_resources_per_node` is a private attribute and access will be removed in a future Ray version.
(TaskRunner pid=314652) Size of train dataloader: 58, Size of val dataloader: 1
(TaskRunner pid=314652) Total training steps: 58
(TaskRunner pid=314652) colocated worker base class <class 'verl.single_controller.base.worker.Worker'>
(TaskRunner pid=314652) WARNING:2025-06-25 04:36:00,838:Waiting for register center actor vx7Ymm_register_center to be ready. Elapsed time: 0 seconds out of 300 seconds.
(WorkerDict pid=324767) Skipping monkey patch for Qwen2ForCausalLM as use_fused_kernels is False or fused_kernels_backend is torch
(WorkerDict pid=324545) Model config after override: Qwen2Config {
(WorkerDict pid=324771) Skipping monkey patch for Qwen2ForCausalLM as use_fused_kernels is False or fused_kernels_backend is torch [repeated 6x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)