node exporter 作用: 是收集操作系统的基本系统, 例如cpu, 内存, 硬盘空间等基本信息, 并对外提供api接口用于prometheus查询存储;
1)docker方式运行node exporter
docker run -d --name node-exporter -p 9100:9100 -v "/proc:/host/proc:ro" -v "/sys:/host/sys:ro" -v "/:/rootfs:ro" --restart=always --net="host" prom/node-exporter --path.procfs /host/proc --path.sysfs /host/sys --collector.filesystem.ignored-mount-points "^/(sys|proc|dev|host|etc)($|/)"
2)验证, 可以通过对外暴露的api接口获取数据
curl http://192.168.1.42:9100/metrics
#(2)安装consul
consul作用: 服务注册中心,向外提供服务的增删api接口, prometheus可以向consul动态获取节点信息以及自动加载配置
1)docker安装consul
docker run --restart=always --name consul -d -p 8500:8500 consul
2)向consul的api接口添加服务
curl -X PUT -d ‘{"id": "node03","name": "node03","address": "192.168.1.42","port": 9100,"tags": ["test"],"checks": [{"http": "http://192.168.1.42:9100/","interval": "5s"}]}‘ http://localhost:8500/v1/agent/service/register
扩展: 删除服务节点
curl -X PUT http://localhost:8500/v1/agent/service/deregister/node02
3)服务注册成功
#(3)安装和配置altermanger
altermanager作用: 接收prometheus发送的告警信息, 通过相关方式例如邮件和微信等方式发送给接收者;
0)准备目录
test -d /etc/alertmanager || mkdir -pv /etc/alertmanager
1)准备配置文件
# cat /etc/alertmanager/alertmanager.yml global: resolve_timeout: 5m templates: - ‘/etc/alertmanager/wechat.tmpl‘ route: group_by: [‘alertname‘] group_wait: 10s group_interval: 10s repeat_interval: 1h receiver: ‘wechat‘ receivers: - name: ‘wechat‘ wechat_configs: - corp_id: ‘wwc08fcb42fc6fe93c‘ to_party: ‘2‘ agent_id: ‘1000002‘ api_secret: ‘cLG91Xgcd3o3zPJp6NbOJV9m7SBIlhtCScxov3Hp-XQ‘ send_resolved: true
2)准备模板文件
# cat /etc/alertmanager/wechat.tmpl {{ define "wechat.default.message" }} {{ range .Alerts }} ========start========== 告警程序:prometheus_alert 告警级别:{{ .Labels.severity }} 告警类型:{{ .Labels.alertname }} 故障主机: {{ .Labels.instance }} 告警主题: {{ .Annotations.summary }} 告警详情: {{ .Annotations.description }} 触发时间: {{ .StartsAt.Format "2006-01-02 15:04:05" }} ========end========== {{ end }} {{ end }}
3)启动容器
docker run --restart=always -d -p 9093:9093 -v /etc/alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml -v /etc/alertmanager/wechat.tmpl:/etc/alertmanager/wechat.tmpl --name alertmanager prom/alertmanager
4)验证容器是否有报错
docker logs -f alertmanager
#(4)安装和配置prometheus
prometheus作用: 用于向exporter获取数据并保存数据, 同时可以设置规则和触发器, 向报警器发送信息;
1)准备目录
test -d /etc/prometheus || mkdir /etc/prometheus -pv
2)准备prometheus配置文件
rule_files : 报警规则文件
alerting: 当触发报警, 把报警相关发送给altermanager, 由altermanager接收告警信息在发送给接收人;
job_name: consul : prometheus 向consul注册;
# cat /etc/prometheus/prometheus.yml global: scrape_interval: 15s evaluation_interval: 15s rule_files: - "/etc/prometheus/*.rules" alerting: alertmanagers: - static_configs: - targets: - "192.168.1.82:9093" scrape_configs: - job_name: prometheus static_configs: - targets: [‘localhost:9090‘] labels: instance: prometheus - job_name: ‘consul‘ consul_sd_configs: - server: ‘192.168.1.82:8500‘ services: [] relabel_configs: - source_labels: [__meta_consul_tags] regex: .*test.* action: keep
3)准备告警规则文件 , 注意该文件不能有tag键, 同时key和value之间必须要有空格
# cat /etc/prometheus/prometheus.rules groups: - name: alert-rule rules: - alert: NodeFilesystemUsage-high expr: (1- (node_filesystem_free_bytes{fstype=~"ext3|ext4|xfs"} / node_filesystem_size_bytes{fstype=~"ext3|ext4|xfs"}) ) * 100 > 80 for: 2m labels: severity: warning annotations: summary: "{{$labels.instance}}: High Node Filesystem usage detected" description: "{{$labels.instance}}: Node Filesystem usage is above 80% ,(current value is: {{ $value }})" - alert: NodeMemoryUsage expr: (100 - (((node_memory_MemFree_bytes+node_memory_Buffers_bytes+node_memory_Cached_bytes)/node_memory_MemTotal_bytes) * 100)) > 80 for: 2m labels: severity: warning annotations: summary: "{{$labels.instance}}: High Node Memory usage detected" description: "{{$labels.instance}}: Node Memory usage is above 80% ,(current value is: {{ $value }})" - alert: NodeCPUUsage expr: (100 - (avg by (instance)(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)) > 80 for: 2m labels: severity: warning annotations: summary: "{{$labels.instance}}: Node High CPU usage detected" description: "{{$labels.instance}}: Node CPU usage is above 80% ,(current value is: {{ $value }})"
4)docker方式启动prometheus
docker run --restart=always --name prometheus -d -p 9090:9090 -v /etc/prometheus:/etc/prometheus prom/prometheus
5)登录到prometheus验证
rule这里能看到相关规则
#(4)下载安装和配置grafana
1)下载和启动grafana
wget https://dl.grafana.com/oss/release/grafana-6.0.2-1.x86_64.rpm yum install grafana-6.0.2-1.x86_64.rpm -y systemctl start grafana-server systemctl enable grafana-server ss -anltup |grep 3000
2)添加图形
https://grafana.com/dashboards 页面搜索node exporter 根据id导入模板 id 为8919
3)查看图形
9)安装饼图插件
grafana-cli plugins install grafana-piechart-panel systemctl restart grafana-server