代码编织梦想

欢迎关注我的CSDN:https://spike.blog.csdn.net/
本文地址:https://blog.csdn.net/caroline_wendy/article/details/129830379

Kubernetes 是一个开源的容器编排平台,支持自动部署、扩缩和管理容器化的应用程序,设计原理是基于 Google 多年的生产环境经验,以及社区的最佳实践,可以在物理机、虚拟机、公有云、私有云或混合云等各种基础设施上运行,提供了高效、灵活和可扩展的容器管理能力。Kubernetes 的核心组件包括控制平面和计算节点,以及一些必要的服务,如仓库、网络、监控、安全等,通过抽象出 pod、service、replication controller 等概念,实现了容器的分组、调度、负载均衡、服务发现、状态检查和自我修复等功能。Kubernetes 是目前最流行的容器编排平台,也是云原生应用的基石。

K8S

错误:

[1,5]<stderr>:RuntimeError: Horovod has been shut down. This was caused by an exception on one of the ranks or an attempt to allreduce, allgather or broadcast a tensor after one of the ranks finished execution. If the shutdown was caused by an exception, you should see the exception in the log before the first shutdown message.
  • [1,5]<stderr>:RuntimeError: Horovod已经关闭。这是由一个等级上的异常或者在一个等级完成执行后尝试allreduce,allgather或者广播一个张量引起的。如果关闭是由异常引起的,你应该在第一个关闭消息之前的日志中看到异常。

完整日志:

Epoch 0: 100% 1250/1250 [5:56:48<00:00, 17.13s/it, loss=93.3][1,0]<stderr>:2023-03-28 18:04:56,682 INFO 53 [train_openfold.py:206] epoch=1, step=1250, avg_loss=112.9477
Epoch 0: 100% 1250/1250 [5:56:49<00:00, 17.13s/it, loss=93.3][1,3]<stderr>:2023-03-28 18:04:56,698 INFO 53 [train_openfold.py:206] epoch=1, step=1250, avg_loss=112.9537
[1,6]<stderr>:2023-03-28 18:04:56,717 INFO 53 [train_openfold.py:206] epoch=1, step=1250, avg_loss=113.49
[1,1]<stderr>:2023-03-28 18:04:56,718 INFO 53 [train_openfold.py:206] epoch=1, step=1250, avg_loss=112.6026
[1,5]<stderr>:2023-03-28 18:04:56,727 INFO 53 [train_openfold.py:206] epoch=1, step=1250, avg_loss=112.2132
[1,7]<stderr>:2023-03-28 18:04:56,729 INFO 53 [train_openfold.py:206] epoch=1, step=1250, avg_loss=113.4736
[1,4]<stderr>:2023-03-28 18:04:56,849 INFO 53 [train_openfold.py:206] epoch=1, step=1250, avg_loss=112.2597
[1,2]<stderr>:2023-03-28 18:04:56,972 INFO 53 [train_openfold.py:206] epoch=1, step=1250, avg_loss=112.6451
[1,0]<stderr>:Traceback (most recent call last):
[1,0]<stderr>:  File "/nfs/xxx/workspace/af2/af2-multimer-ex/bin/train/../../train_openfold.py", line 919, in <module>
[1,0]<stderr>:    main(args)
[1,0]<stderr>:  File "/nfs/xxx/workspace/af2/af2-multimer-ex/bin/train/../../train_openfold.py", line 503, in main
[1,0]<stderr>:    trainer.fit(model_module, datamodule=data_module, ckpt_path=ckpt_path)
[1,0]<stderr>:  File "/opt/conda/envs/openfold/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 580, in fit
[1,0]<stderr>:    self, self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
[1,0]<stderr>:  File "/opt/conda/envs/openfold/lib/python3.7/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt
[1,0]<stderr>:    return trainer_fn(*args, **kwargs)
[1,0]<stderr>:  File "/opt/conda/envs/openfold/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 621, in _fit_impl
[1,0]<stderr>:    self._run(model, ckpt_path=self.ckpt_path)
[1,0]<stderr>:  File "/opt/conda/envs/openfold/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1058, in _run
[1,0]<stderr>:    results = self._run_stage()
[1,0]<stderr>:  File "/opt/conda/envs/openfold/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1137, in _run_stage
[1,0]<stderr>:    self._run_train()
[1,0]<stderr>:  File "/opt/conda/envs/openfold/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1160, in _run_train
[1,0]<stderr>:    self.fit_loop.run()
[1,0]<stderr>:  File "/opt/conda/envs/openfold/lib/python3.7/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
[1,0]<stderr>:    self.on_advance_end()
[1,0]<stderr>:  File "/opt/conda/envs/openfold/lib/python3.7/site-packages/pytorch_lightning/loops/fit_loop.py", line 295, in on_advance_end
[1,0]<stderr>:    self.trainer._call_callback_hooks("on_train_epoch_end")
[1,0]<stderr>:  File "/opt/conda/envs/openfold/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1340, in _call_callback_hooks
[1,0]<stderr>:    fn(self, self.lightning_module, *args, **kwargs)
[1,0]<stderr>:  File "/opt/conda/envs/openfold/lib/python3.7/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 312, in on_train_epoch_end
[1,0]<stderr>:    self._save_topk_checkpoint(trainer, monitor_candidates)
[1,0]<stderr>:  File "/opt/conda/envs/openfold/lib/python3.7/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 371, in _save_topk_checkpoint
[1,0]<stderr>:    self._save_none_monitor_checkpoint(trainer, monitor_candidates)
[1,0]<stderr>:  File "/opt/conda/envs/openfold/lib/python3.7/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 662, in _save_none_monitor_checkpoint
[1,0]<stderr>:    self._save_checkpoint(trainer, filepath)
[1,0]<stderr>:  File "/opt/conda/envs/openfold/lib/python3.7/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 374, in _save_checkpoint
[1,0]<stderr>:    trainer.save_checkpoint(filepath, self.save_weights_only)
[1,0]<stderr>:  File "/opt/conda/envs/openfold/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1901, in save_checkpoint
[1,0]<stderr>:    self._checkpoint_connector.save_checkpoint(filepath, weights_only=weights_only, storage_options=storage_options)
[1,0]<stderr>:  File "/opt/conda/envs/openfold/lib/python3.7/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 535, in save_checkpoint
[1,0]<stderr>:    self.trainer.strategy.save_checkpoint(_checkpoint, filepath, storage_options=storage_options)
[1,0]<stderr>:  File "/opt/conda/envs/openfold/lib/python3.7/site-packages/pytorch_lightning/strategies/strategy.py", line 466, in save_checkpoint
[1,0]<stderr>:    self.checkpoint_io.save_checkpoint(checkpoint, filepath, storage_options=storage_options)
[1,0]<stderr>:  File "/opt/conda/envs/openfold/lib/python3.7/site-packages/lightning_lite/plugins/io/torch_io.py", line 51, in save_checkpoint
[1,0]<stderr>:    fs.makedirs(os.path.dirname(path), exist_ok=True)
[1,0]<stderr>:  File "/opt/conda/envs/openfold/lib/python3.7/site-packages/fsspec/implementations/local.py", line 47, in makedirs
[1,0]<stderr>:    os.makedirs(path, exist_ok=exist_ok)
[1,0]<stderr>:  File "/opt/conda/envs/openfold/lib/python3.7/os.py", line 223, in makedirs
[1,0]<stderr>:    mkdir(name, mode)
[1,0]<stderr>:PermissionError: [Errno 13] Permission denied: '/nfs/xxx/workspace/af2/output/checkpoints'
[1,5]<stderr>:Traceback (most recent call last):
[1,5]<stderr>:  File "/opt/conda/envs/openfold/lib/python3.7/site-packages/horovod/torch/mpi_ops.py", line 1253, in synchronize
[1,5]<stderr>:    mpi_lib.horovod_torch_wait_and_clear(handle)
[1,5]<stderr>:RuntimeError: Horovod has been shut down. This was caused by an exception on one of the ranks or an attempt to allreduce, allgather or broadcast a tensor after one of the ranks finished execution. If the shutdown was caused by an exception, you should see the exception in the log before the first shutdown message.

核心问题:

[1,0]<stderr>:  File "/opt/conda/envs/openfold/lib/python3.7/site-packages/pytorch_lightning/strategies/strategy.py", line 466, in save_checkpoint
[1,0]<stderr>:    self.checkpoint_io.save_checkpoint(checkpoint, filepath, storage_options=storage_options)
[1,0]<stderr>:  File "/opt/conda/envs/openfold/lib/python3.7/site-packages/lightning_lite/plugins/io/torch_io.py", line 51, in save_checkpoint
[1,0]<stderr>:    fs.makedirs(os.path.dirname(path), exist_ok=True)
[1,0]<stderr>:  File "/opt/conda/envs/openfold/lib/python3.7/site-packages/fsspec/implementations/local.py", line 47, in makedirs
[1,0]<stderr>:    os.makedirs(path, exist_ok=exist_ok)
[1,0]<stderr>:  File "/opt/conda/envs/openfold/lib/python3.7/os.py", line 223, in makedirs
[1,0]<stderr>:    mkdir(name, mode)
[1,0]<stderr>:PermissionError: [Errno 13] Permission denied: '/nfs/xxx/workspace/af2/output/checkpoints'

即,在存储文件时,权限错误,文件夹checkpoints没有权限。

赋予全局权限,即可:

cd /nfs/xxx/workspace/af2/output/
chmod a+w -R .
版权声明:本文为博主原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。
本文链接:https://blog.csdn.net/caroline_wendy/article/details/129830379

Kubernetes K8S之存储PV-PVC详解-爱代码爱编程

K8S之存储PV-PVC概述与说明,并详解常用PV-PVC示例 概述 与管理计算实例相比,管理存储是一个明显的问题。PersistentVolume子系统为用户和管理员提供了一个API,该API从如何使用存储中抽象出如何提供存储的详细信息。为此,我们引入了两个新的API资源:PersistentVolume和PersistentVolumeC

K8s JavaClient watch Pod检测状态变更、和Read timed out异常-爱代码爱编程

watch k8s很多命令都有watch机制,持续检测状态变化,如pod列表,如果pod状态发生变化,就会输出 kubectl get pod -w 或者--watch JAVA Cient watch podList 官网介绍:https://kubernetes.io/zh/docs/reference/using-api/api-co

K8S线上集群排查,实测排查Node节点NotReady异常状态-爱代码爱编程

一,文章简述 大家好,本篇是个人的第 2 篇文章。是关于在之前项目中,k8s 线上集群中 Node 节点状态变成 NotReady 状态,导致整个 Node 节点中容器停止服务后的问题排查。 文章中所描述的是本人在项目中线上环境实际解决的,那除了如何解决该问题,更重要的是如何去排查这个问题的起因。 关于 Node 节点不可用的 NotReady 状态

k8s core-dns 解析域名异常-爱代码爱编程

背景 我司使用的是混合云,云上环境k8s中处理数据的pod不能正常解析公司内部存储的dns域名,从而导致数据处理程序失败,但是core-dns所在的pod上市能解析该域名的。 分析 1. 域名完全不能解析 查看coredns的配置文件如下 apiVersion: v1 data: Corefile: | .:53 {

使用 java-client 连接k8s-爱代码爱编程

1、pom.xml <!-- k8s client --> <dependency> <groupId>io.kubernetes</groupId> <artifactId>client-java</artifactI

.net core 3 -爱代码爱编程

NET CORE 5.0 全局日志的书写 第一步 创建.NET CORE项目 第二步 项目依赖项下安装neget包 log4net dotnet add package log4net 第三步 创建log4net.