# docs

**Repository Path**: yangyongguang/docs

## Basic Information

- **Project Name**: docs
- **Description**: To build and enrich documentation for openEuler project.
- **Primary Language**: Unknown
- **License**: CC-BY-SA-4.0
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 662
- **Created**: 2024-09-23
- **Last Updated**: 2025-02-13

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# gala-anteater使用手册

gala-anteater是一款基于AI的操作系统异常检测平台。主要提供时序数据预处理、异常点发现、异常上报等功能。基于线下预训练、线上模型的增量学习与模型更新，能够很好地适应于多维多模态数据故障诊断。

本文主要介绍如何部署和使用gala-anteater服务。

## 安装

挂载repo源：

```basic
[everything]
name=everything
baseurl=http://121.36.84.172/dailybuild/EBS-openEuler-24.09/EBS-openEuler-24.09/everything/$basearch/
enabled=1
gpgcheck=0
priority=1

[EPOL]
name=EPOL
baseurl=http://repo.openeuler.org/openEuler-22.03-LTS-SP4/EPOL/main/$basearch/
enabled=1
gpgcheck=0
priority=1

```

安装gala-anteater：

```bash
# yum install gala-anteater
```

## 配置

> 说明：gala-anteater不包含额外需要配置的config文件，其参数通过命令行的启动参数传递。

##### 启动参数介绍

| 参数项 | 参数详细名 | 类型 | 是否必须 | 默认值 | 名称 | 含义 |
|---|---|---|---|---|---|---|
| -ks | --kafka_server | string | True |  | KAFKA_SERVER | Kafka Server的ip地址，如：localhost / xxx.xxx.xxx.xxx |
| -kp | --kafka_port | string | True |  | KAFKA_PORT | Kafka Server的port，如：9092 |
| -ps | --prometheus_server | string | True |  | PROMETHEUS_SERVER | Prometheus Server的ip地址，如：localhost / xxx.xxx.xxx.xxx |
| -pp | --prometheus_port | string | True |  | PROMETHEUS_PORT | Prometheus Server的port，如：9090 |
| -m | --model | string | False | vae | MODEL | 异常检测模型，目前支持两种异常检测模型，可选（random_forest，vae）<br>random_forest：随机森林模型，不支持在线学习<br>vae：Variational Autoencoder，无监督模型，支持首次启动时，利用历史数据，进行模型更新迭代 |
| -d | --duration | int | False | 1 | DURATION | 异常检测模型执行频率（单位：分），每x分钟，检测一次 |
| -r | --retrain | bool | False | False | RETRAIN | 是否在启动时，利用历史数据，进行模型更新迭代，目前仅支持vae模型 |
| -l | --look_back | int | False | 4 | LOOK_BACK | 利用过去x天的历史数据，更新模型 |
| -t | --threshold | float | False | 0.8 | THRESHOLD | 异常检测模型的阈值：（0,1），较大的值，能够减少模型的误报率，推荐大于等于0.5 |
| -sli | --sli_time | int | False | 400 | SLI_TIME | 表示应用性能指标（单位：毫秒），较大的值，能够减少模型的误报率，推荐大于等于200<br>对于误报率较高的场景，推荐1000以上 |

## 启动

执行如下命令启动gala-anteater
```
systemctl start gala-anteater
```

### 在线训练方式运行（推荐）

```bash
gala-anteater -ks {ip} -kp {port} -ps {ip} -pp {port} -m vae -r True -l 7 -t 0.6 -sli 400
```

### 普通方式运行
systemctl start gala-anteater
预期结果: systemctl status gala-anteater 服务启动状态为  runing

```bash
gala-anteater -ks {ip} -kp {port} -ps {ip} -pp {port} -m vae -t 0.6 -sli 400
```

### 故障注入
gala-anteater 为故障检测与根因定位模块，测试阶段需要通过故障注入来构造故障， 从而通过故障检测和根因定位模块获得故障节点信息和故障传播根因节点
* 故障注入(仅提供参考)
    ```bash
    chaosblade create disk burn --size 10 --read --write --path /var/lib/docker/overlay2/cf0a469be8a84cabe1d057216505f8d64735e9c63159e170743353a208f6c268/merged --timeout 120
    
    ```
    *chaosblade 为故障注入工具， 可以模拟各种故障， 包括但不限于磁盘故障、网络故障、IO故障等待
    备注： 通过注入不一样的故障， 指标采集器(例如 gala-gopher) 监控关联指标并上报到 promethues 模块， prometheus graph 指标图部分关联指标会存在明显波动。

### 查询gala-anteater服务状态

若日志显示如下内容，说明服务启动成功，启动日志也会保存到当前运行目录下`logs/anteater.log`文件中。

```log
2022-09-01 17:52:54,435 - root - INFO - Run gala_anteater main function...
2022-09-01 17:52:54,436 - root - INFO - Start to try updating global configurations by querying data from Kafka!
2022-09-01 17:52:54,994 - root - INFO - Loads metric and operators from file: xxx\metrics.csv
2022-09-01 17:52:54,997 - root - INFO - Loads metric and operators from file: xxx\metrics.csv
2022-09-01 17:52:54,998 - root - INFO - Start to re-train the model based on last day metrics dataset!
2022-09-01 17:52:54,998 - root - INFO - Get training data during 2022-08-31 17:52:00+08:00 to 2022-09-01 17:52:00+08:00!
2022-09-01 17:53:06,994 - root - INFO - Spends: 11.995422840118408 seconds to get unique machine_ids!
2022-09-01 17:53:06,995 - root - INFO - The number of unique machine ids is: 1!                            
2022-09-01 17:53:06,996 - root - INFO - Fetch metric values from machine: xxxx.
2022-09-01 17:53:38,385 - root - INFO - Spends: 31.3896164894104 seconds to get get all metric values!
2022-09-01 17:53:38,392 - root - INFO - The shape of training data: (17281, 136)
2022-09-01 17:53:38,444 - root - INFO - Start to execute vae model training...
2022-09-01 17:53:38,456 - root - INFO - Using cpu device
2022-09-01 17:53:38,658 - root - INFO - Epoch(s): 0     train Loss: 136.68      validate Loss: 117.00
2022-09-01 17:53:38,852 - root - INFO - Epoch(s): 1     train Loss: 113.73      validate Loss: 110.05
2022-09-01 17:53:39,044 - root - INFO - Epoch(s): 2     train Loss: 110.60      validate Loss: 108.76
2022-09-01 17:53:39,235 - root - INFO - Epoch(s): 3     train Loss: 109.39      validate Loss: 106.93
2022-09-01 17:53:39,419 - root - INFO - Epoch(s): 4     train Loss: 106.48      validate Loss: 103.37
...
2022-09-01 17:53:57,744 - root - INFO - Epoch(s): 98    train Loss: 97.63       validate Loss: 96.76
2022-09-01 17:53:57,945 - root - INFO - Epoch(s): 99    train Loss: 97.75       validate Loss: 96.58
2022-09-01 17:53:57,969 - root - INFO - Schedule recurrent job with time interval 1 minute(s).
2022-09-01 17:53:57,973 - apscheduler.scheduler - INFO - Adding job tentatively -- it will be properly scheduled when the scheduler starts
2022-09-01 17:53:57,974 - apscheduler.scheduler - INFO - Added job "partial" to job store "default"
2022-09-01 17:53:57,974 - apscheduler.scheduler - INFO - Scheduler started
2022-09-01 17:53:57,975 - apscheduler.scheduler - DEBUG - Looking for jobs to run
2022-09-01 17:53:57,975 - apscheduler.scheduler - DEBUG - Next wakeup is due at 2022-09-01 17:54:57.973533+08:00 (in 59.998006 seconds)
```

## 输出数据

gala-anteater如果检测到的异常点，会将结果输出至kafka。输出数据格式如下：

```json
{
   "Timestamp":1659075600000,
   "Attributes":{
      "entity_id":"xxxxxx_sli_1513_18",
      "event_id":"1659075600000_1fd37742xxxx_sli_1513_18",
      "event_type":"app"
   },
   "Resource":{
      "anomaly_score":1.0,
      "anomaly_count":13,
      "total_count":13,
      "duration":60,
      "anomaly_ratio":1.0,
      "metric_label":{
         "machine_id":"1fd37742xxxx",
         "tgid":"1513",
         "conn_fd":"18"
      },
      "recommend_metrics":{
         "gala_gopher_tcp_link_notack_bytes":{
            "label":{
               "__name__":"gala_gopher_tcp_link_notack_bytes",
               "client_ip":"x.x.x.165",
               "client_port":"51352",
               "hostname":"localhost.localdomain",
               "instance":"x.x.x.172:8888",
               "job":"prometheus-x.x.x.172",
               "machine_id":"xxxxxx",
               "protocol":"2",
               "role":"0",
               "server_ip":"x.x.x.172",
               "server_port":"8888",
               "tgid":"3381701"
            },
            "score":0.24421279500639545
         },
         ...
      },
      "metrics":"gala_gopher_ksliprobe_recent_rtt_nsec"
   },
   "SeverityText":"WARN",
   "SeverityNumber":14,
   "Body":"TimeStamp, WARN, APP may be impacting sli performance issues."
}
```