4fd89b038d
Co-authored-by: multica-agent <github@multica.ai>
274 lines
6.7 KiB
Markdown
274 lines
6.7 KiB
Markdown
# 故障排查手册
|
||
|
||
## 元数据
|
||
|
||
| 属性 | 值 |
|
||
|------|-----|
|
||
| **领域** | 运维 |
|
||
| **责任人** | 严维序(opengineer) |
|
||
| **版本** | v1.0 |
|
||
| **创建日期** | 2026-06-24 |
|
||
| **最后更新** | 2026-06-24 |
|
||
| **标签** | 故障排查, 运维, 排障 |
|
||
|
||
## 概述
|
||
|
||
本手册汇总 BizWings 环境中常见的系统与服务故障定位方法和修复方案。覆盖 SSH 连接、Nginx、数据库、磁盘、Docker 等核心场景。
|
||
|
||
---
|
||
|
||
## 一、SSH 连接故障
|
||
|
||
### 1.1 连接超时
|
||
|
||
```bash
|
||
# 诊断步骤
|
||
ssh -vvv root@<ip> -p <port> # 查看详细连接日志
|
||
ping <ip> # 检查网络连通性
|
||
nmap <ip> -p <port> # 检查端口状态
|
||
```
|
||
|
||
**常见原因**:
|
||
- 目标服务器防火墙未开放端口
|
||
- 源 IP 未加入白名单
|
||
- 服务器负载过高,sshd 响应慢
|
||
|
||
**解决方案**:
|
||
1. 检查服务器防火墙:`iptables -L -n` 或 `ufw status`
|
||
2. 检查 sshd 是否运行:`systemctl status sshd`
|
||
3. 检查负载:`top -n1 | head -5`
|
||
|
||
### 1.2 认证失败
|
||
|
||
```bash
|
||
# 诊断步骤
|
||
ssh -p <port> root@<ip> # 尝试密码登录
|
||
# Permission denied (publickey,password) 提示
|
||
```
|
||
|
||
**常见原因**:
|
||
- 密码错误(检查 TOOLS.md 中记录)
|
||
- SSH 密钥认证配置错误
|
||
- `/etc/ssh/sshd_config` 中 `PasswordAuthentication no`
|
||
|
||
**解决方案**:
|
||
1. 确认密码与 TOOLS.md 一致
|
||
2. 检查 `sshd_config`:`grep PasswordAuthentication /etc/ssh/sshd_config`
|
||
3. 临时允许密码登录:`sed -i 's/PasswordAuthentication no/PasswordAuthentication yes/' /etc/ssh/sshd_config && systemctl reload sshd`
|
||
|
||
---
|
||
|
||
## 二、Nginx 服务异常
|
||
|
||
### 2.1 Nginx 启动失败 / 卡在 activating
|
||
|
||
```bash
|
||
# 诊断步骤
|
||
systemctl status nginx # 查看状态
|
||
journalctl -u nginx --no-pager -n 50 # 查看日志
|
||
nginx -t # 配置语法检查
|
||
```
|
||
|
||
**根因(经验)**:进程残留导致端口占用
|
||
|
||
```bash
|
||
# 修复
|
||
pkill -9 nginx # 强制清理残留进程
|
||
sleep 2
|
||
systemctl start nginx # 重新启动
|
||
systemctl status nginx # 确认状态
|
||
```
|
||
|
||
### 2.2 502 Bad Gateway
|
||
|
||
```bash
|
||
# 诊断步骤
|
||
curl -I http://localhost:<upstream-port> # 检查上游服务
|
||
ss -tlnp | grep <upstream-port> # 检查端口监听
|
||
systemctl status <upstream-service> # 检查上游进程
|
||
```
|
||
|
||
**常见原因**:
|
||
- 上游服务未启动或崩溃
|
||
- 连接池耗尽
|
||
|
||
**解决方案**:
|
||
1. 重启上游服务:`systemctl restart <service>`
|
||
2. 检查 `upstream` 配置是否正确
|
||
|
||
### 2.3 日志轮转失败
|
||
|
||
```bash
|
||
# 诊断步骤
|
||
cat /var/log/nginx/error.log | head # 查看是否有日志无法写入
|
||
ls -la /var/log/nginx/ # 查看日志文件
|
||
/usr/sbin/logrotate -d /etc/logrotate.d/nginx # 测试 logrotate
|
||
```
|
||
|
||
**修复方案**:
|
||
```bash
|
||
# 修改 /etc/logrotate.d/nginx 中的 postrotate 脚本
|
||
# 将 invoke-rc.d nginx rotate 改为:
|
||
postrotate
|
||
systemctl reload nginx
|
||
endscript
|
||
```
|
||
|
||
---
|
||
|
||
## 三、数据库连接故障
|
||
|
||
### 3.1 MySQL 连接失败
|
||
|
||
```bash
|
||
# 诊断步骤
|
||
mysql -h <host> -P <port> -u root -p # 测试连接
|
||
telnet <host> <port> # 检查端口
|
||
systemctl status mysql # 检查服务
|
||
```
|
||
|
||
**常见原因**:
|
||
- 服务未运行
|
||
- 防火墙未放行 3306 端口
|
||
- 用户权限 / host 限制
|
||
- 连接数超限
|
||
|
||
**解决方案**:
|
||
```bash
|
||
# 检查连接数
|
||
mysql -e "SHOW VARIABLES LIKE 'max_connections';"
|
||
mysql -e "SHOW PROCESSLIST;"
|
||
|
||
# 检查用户权限
|
||
mysql -e "SELECT user, host FROM mysql.user WHERE user='root';"
|
||
```
|
||
|
||
### 3.2 MySQL 空间不足
|
||
|
||
```bash
|
||
# 诊断
|
||
df -h # 磁盘空间
|
||
mysql -e "SELECT table_schema, ROUND(SUM(data_length+index_length)/1024/1024,2) AS size_mb FROM information_schema.tables GROUP BY table_schema ORDER BY size_mb DESC;"
|
||
```
|
||
|
||
**解决方案**:
|
||
- 清理过期 binlog:`PURGE BINARY LOGS BEFORE DATE_SUB(NOW(), INTERVAL 7 DAY);`
|
||
- 清理临时表
|
||
- 扩展磁盘
|
||
|
||
---
|
||
|
||
## 四、磁盘空间告警
|
||
|
||
### 4.1 诊断
|
||
|
||
```bash
|
||
df -h # 查看各分区使用率
|
||
du -sh /* 2>/dev/null | sort -rh | head -10 # 找到大文件目录
|
||
find / -type f -size +100M -exec ls -lh {} \; 2>/dev/null # 大文件定位
|
||
```
|
||
|
||
### 4.2 清理方案
|
||
|
||
```bash
|
||
# Docker 日志和镜像清理
|
||
docker system prune -af --volumes # 清理未使用的 Docker 资源
|
||
|
||
# 系统日志轮转
|
||
journalctl --vacuum-time=7d # 清理 7 天前的 journal 日志
|
||
|
||
# 应用日志归档
|
||
find /var/log -name "*.log" -mtime +30 -exec gzip {} \; # 压缩旧日志
|
||
find /var/log -name "*.gz" -mtime +90 -delete # 删除 90 天前的压缩日志
|
||
```
|
||
|
||
---
|
||
|
||
## 五、Docker 容器异常
|
||
|
||
### 5.1 容器停止
|
||
|
||
```bash
|
||
docker ps -a | grep <container> # 查看容器状态
|
||
docker logs <container> --tail 50 # 查看最近日志
|
||
```
|
||
|
||
**修复**:
|
||
```bash
|
||
docker start <container> # 手动启动
|
||
docker compose -f <path> up -d # 使用 Compose 重启
|
||
```
|
||
|
||
### 5.2 Docker API 无响应
|
||
|
||
```bash
|
||
systemctl status docker # 检查 Docker 服务
|
||
journalctl -u docker --no-pager -n 50 # 查看 Docker 日志
|
||
```
|
||
|
||
**修复**:
|
||
```bash
|
||
systemctl restart docker # 重启 Docker 守护进程
|
||
```
|
||
|
||
---
|
||
|
||
## 六、系统进程故障
|
||
|
||
### 6.1 端口被占用
|
||
|
||
```bash
|
||
ss -tlnp | grep <port> # 查看占用端口的进程
|
||
fuser -k <port>/tcp # 强制释放端口
|
||
```
|
||
|
||
### 6.2 systemd 服务异常
|
||
|
||
```bash
|
||
systemctl status <service> # 检查状态
|
||
journalctl -u <service> --no-pager -n 100 # 查看服务日志
|
||
|
||
# 常用修复
|
||
systemctl daemon-reload # 重载 unit 文件
|
||
systemctl restart <service> # 重启
|
||
systemctl enable <service> # 设置开机自启
|
||
```
|
||
|
||
---
|
||
|
||
## 七、日志分析工具
|
||
|
||
### 7.1 常用命令
|
||
|
||
```bash
|
||
# 实时日志跟踪
|
||
tail -f /var/log/<app>/access.log
|
||
|
||
# 错误过滤
|
||
grep -i "error\|exception\|failed" /var/log/<app>/app.log | tail -50
|
||
|
||
# 时间范围过滤
|
||
awk '/2026-06-24 10:00/,/2026-06-24 11:00/' /var/log/<app>/app.log
|
||
```
|
||
|
||
### 7.2 关键检查点
|
||
|
||
| 故障表现 | 优先检查 | 常见根因 |
|
||
|----------|----------|----------|
|
||
| 服务无响应 | systemctl status | 进程 OOM / 崩溃 |
|
||
| API 返回错误 | 应用日志 + Nginx 日志 | 代码 bug / 上游依赖异常 |
|
||
| 高延迟 | top + ss + 应用日志 | 资源争抢 / 死锁 |
|
||
| 数据库异常 | MySQL error log | 慢查询 / 连接数超限 |
|
||
|
||
---
|
||
|
||
## 相关条目
|
||
|
||
- [部署流程_v1.0.md](部署流程_v1.0.md)
|
||
- [服务器运维标准_v1.0.md](服务器运维标准_v1.0.md)
|
||
|
||
## 变更记录
|
||
|
||
| 日期 | 版本 | 变更说明 | 变更人 |
|
||
|------|------|----------|--------|
|
||
| 2026-06-24 | v1.0 | 初始创建 | 严维序 | |