Prometheus

什么是Prometheus?

Prometheus是由SoundCloud开发的开源监控报警系统和时序列数据库(TSDB)。Prometheus使用Go语言开发，是Google BorgMon监控系统的开源版本。
2016年由Google发起Linux基金会旗下的原生云基金会(Cloud Native Computing Foundation), 将Prometheus纳入其下第二大开源项目。
Prometheus目前在开源社区相当活跃。
Prometheus和Heapster(Heapster是K8S的一个子项目，用于获取集群的性能数据。)相比功能更完善、更全面。Prometheus性能也足够支撑上万台规模的集群。

Prometheus的特点

多维度数据模型。
灵活的查询语言。
不依赖分布式存储，单个服务器节点是自主的。
通过基于HTTP的pull方式采集时序数据。
可以通过中间网关进行时序列数据推送。
通过服务发现或者静态配置来发现目标服务对象。
支持多种多样的图表和界面展示，比如Grafana等。

官网地址：https://prometheus.io/

架构图

基本原理

Prometheus的基本原理是通过HTTP协议周期性抓取被监控组件的状态，任意组件只要提供对应的HTTP接口就可以接入监控。不需要任何SDK或者其他的集成过程。这样做非常适合做虚拟化环境监控系统，比如VM、Docker、Kubernetes等。输出被监控组件信息的HTTP接口被叫做exporter 。目前互联网公司常用的组件大部分都有exporter可以直接使用，比如Varnish、Haproxy、Nginx、MySQL、Linux系统信息(包括磁盘、内存、CPU、网络等等)。

服务过程

Prometheus Daemon负责定时去目标上抓取metrics(指标)数据，每个抓取目标需要暴露一个http服务的接口给它定时抓取。Prometheus支持通过配置文件、文本文件、Zookeeper、Consul、DNS SRV Lookup等方式指定抓取目标。Prometheus采用PULL的方式进行监控，即服务器可以直接通过目标PULL数据或者间接地通过中间网关来Push数据。
Prometheus在本地存储抓取的所有数据，并通过一定规则进行清理和整理数据，并把得到的结果存储到新的时间序列中。
Prometheus通过PromQL和其他API可视化地展示收集的数据。Prometheus支持很多方式的图表可视化，例如Grafana、自带的Promdash以及自身提供的模版引擎等等。Prometheus还提供HTTP API的查询方式，自定义所需要的输出。
PushGateway支持Client主动推送metrics到PushGateway，而Prometheus只是定时去Gateway上抓取数据。
Alertmanager是独立于Prometheus的一个组件，可以支持Prometheus的查询语句，提供十分灵活的报警方式。

三大套件

Server 主要负责数据采集和存储，提供PromQL查询语言的支持。
Alertmanager 警告管理器，用来进行报警。
Push Gateway 支持临时性Job主动推送指标的中间网关。

安装

Prometheus

curl -LO https://github.com/prometheus/prometheus/releases/download/v2.23.0/prometheus-2.23.0.linux-amd64.tar.gz

解压安装

tar zxvf prometheus-2.23.0.linux-amd64.tar.gz
mv prometheus-2.23.0.linux-amd64 /opt/
vi /usr/lib/systemd/system/prometheus.service
[Unit]
Description=Prometheus Monitoring System
Documentation=Prometheus Monitoring System

[Service]
ExecStart=/opt/prometheus-2.23.0.linux-amd64/prometheus \
  --config.file=/opt/prometheus-2.23.0.linux-amd64/prometheus.yml \
  --web.listen-address=:9090

[Install]
WantedBy=multi-user.target

启动

systemctl start prometheus
systemctl enable prometheus

配置文件详解

# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'
    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.
    static_configs:
    - targets: ['localhost:9090']

global：此片段指定的是prometheus的全局配置，比如采集间隔，抓取超时时间等。
scrape_interval: 抓取间隔,默认继承global值。
scrape_timeout: 抓取超时时间,默认继承global值。
rule_files：此片段指定报警规则文件， prometheus根据这些规则信息，会推送报警信息到alertmanager中。
scrape_configs: 此片段指定抓取配置，prometheus的数据采集通过此片段配置。
alerting: 此片段指定报警配置，这里主要是指定prometheus将报警规则推送到指定的alertmanager实例地址。
metric_path: 抓取路径，默认是/metrics
scheme: 指定采集使用的协议，http或者https。
params: 指定url参数。
basic_auth: 指定认证信息。
*_sd_configs: 指定服务发现配置
static_configs: 静态指定服务job。
relabel_config: relabel设置。

static_config示例

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
    - targets: ['localhost:9090']
  - job_name: "node"
    static_configs:
    - targets: ['192.168.50.57:9100','192.168.50.58:9100','192.168.50.59:9100']

file_sd_configs示例

scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
    - targets: ['localhost:9090']
  - job_name: "node"
    file_sd_configs:
    - refresh_interval: 1m
      files: 
      - "/usr/local/prometheus/prometheus/conf/node*.yml"

# 独立文件配置如下
cat conf/node-dis.conf
- targets: ['192.168.50.57:9100','192.168.50.58:9100','192.168.50.59:9100']
  或者可以这样配置
[root@node00 conf]# cat node-dis.yml 
- targets: 
  - "192.168.100.10:20001"
  labels: 
    hostname: node00
- targets: 
  - "192.168.100.11:20001"
  labels: 
    hostname: node01
- targets: 
  - "192.168.100.12:20001"
  labels: 
    hostname: node02

通过file_fd_files 配置后我们可以在不重启prometheus的前提下，修改对应的采集文件(node_dis.yml), 在特定的时间内(refresh_interval),prometheus会完成配置信息的载入工作。

relabel_config示例

新标记是一个功能强大的工具，可以在目标的标签集被抓取之前重写它，每个采集配置可以配置多个重写标签设置，并按照配置的顺序来应用于每个目标的标签集。

目标重新标签之后，以__开头的标签将从标签集中删除的。

relabel的action类型

replace: 对标签和标签值进行替换。
keep: 满足特定条件的实例进行采集，其他的不采集。
drop：满足特定条件的实例不采集，其他的采集。
labeldrop：对抓取的实例特定标签进行删除。
labelkeep：对抓取的实例特定标签进行保留，其他标签删除。

replace

原配置

global:
  scrape_interval:     5s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 5s # Evaluate rules every 15 seconds. The default is every 1 minute.
  
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - localhost:9093
      
rule_files:
  - "/opt/prometheus-2.23.0.linux-amd64/rule.yml"

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
    - targets: ['localhost:9090']


  - job_name: 'node'
    file_sd_configs:
    - refresh_interval: 1m
      files: 
      - "/opt/prometheus-2.23.0.linux-amd64/conf/node*.yml"
vi conf/node-dis.yml
      - targets: ['192.168.50.57:9100']
        labels:
          __hostname__: dev-database
          __region_id__: "cn-beijing"
          __availability_zone__: "a"
      - targets: ['localhost:9100']
        labels:
          __hostname__: prometheus
          __region_id__: "cn-beijing"
          __availability_zone__: "b"

此时查看target信息，如下图。

设置relabel,将labels中的__hostname__替换为node_name。

global:
  scrape_interval:     5s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 5s # Evaluate rules every 15 seconds. The default is every 1 minute.
  
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - localhost:9093
      
rule_files:
  - "/opt/prometheus-2.23.0.linux-amd64/rule.yml"

scrape_configs:
  - job_name: 'bounter-monitor'
    scrape_interval: 5s
    metrics_path: '/actuator/prometheus'
    static_configs:
      - targets: ['192.168.10.228:8080']
        labels:
          __hostname__: springboot
    relabel_configs:
      - source_labels:
        - "__hostname__"
        regex: "(.*)"
        target_label: "nodename"
        action: replace
        replacement: "$1"

重启服务查看target信息如下图：

source_labels指定我们我们需要处理的源标签， target_labels指定了我们要replace后的标签名字， action指定relabel动作，这里使用replace替换动作。 regex去匹配源标签（hostname）的值，”(.*)”代表hostname这个标签是什么值都匹配的，然后replacement指定的替换后的标签（target_label）对应的数值。采用正则引用方式获取的。

修改 ‘’regex: “(dev-database)”‘的时候可以看到如下图。

我们的基础信息里面有__region_id__和__availability_zone__但是我想融合2个字段在一起，可以通过replace来实现。

修改配置如下:

global:
  scrape_interval:     5s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 5s # Evaluate rules every 15 seconds. The default is every 1 minute.
  
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - localhost:9093
      
rule_files:
  - "/opt/prometheus-2.23.0.linux-amd64/rule.yml"

scrape_configs:
  - job_name: 'node'
    file_sd_configs:
    - refresh_interval: 1m
      files: 
      - "/opt/prometheus-2.23.0.linux-amd64/conf/node*.yml"
    relabel_configs:
    - source_labels:
      - "__region_id__"
      - "__availability_zone__"
      separator: "-"
      regex: "(.*)"
      target_label: "region_zone"
      action: replace
      replacement: "$1"

target如下图：

keep

原配置

global:
  scrape_interval:     5s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 5s # Evaluate rules every 15 seconds. The default is every 1 minute.
  
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - localhost:9093
      
rule_files:
  - "/opt/prometheus-2.23.0.linux-amd64/rule.yml"

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
    - targets: ['localhost:9090']


  - job_name: 'node'
    file_sd_configs:
    - refresh_interval: 1m
      files: 
      - "/opt/prometheus-2.23.0.linux-amd64/conf/node*.yml"

target信息如下图：

修改配置文件

global:
  scrape_interval:     5s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 5s # Evaluate rules every 15 seconds. The default is every 1 minute.
  
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - localhost:9093
      
rule_files:
  - "/opt/prometheus-2.23.0.linux-amd64/rule.yml"

scrape_configs:
  - job_name: 'bounter-monitor'
    scrape_interval: 5s
    metrics_path: '/actuator/prometheus'
    static_configs:
      - targets: ['192.168.10.228:8080']
        labels:
          __hostname__: springboot
    relabel_configs:
    - source_labels:
      - "__hostname__"
      regex: "(dev-database)"
      action: keep

target如下图:

action为keep,只要source_labels的值匹配regex:（dev-database）的实例才能会被采集。其他的实例不会被采集。

drop

更改action为drop,target如下图:

action为drop,只要source_labels的值匹配regex（dev-database）的实例不会被采集。其他的实例会被采集。

labelkeep

NodeExporter

curl -LO https://github.com/prometheus/node_exporter/releases/download/v1.0.1/node_exporter-1.0.1.linux-amd64.tar.gz

解压安装

tar zxvf node_exporter-1.0.1.linux-amd64.tar.gz
mkdir -p /export/prometheus_exporter
mv node_exporter-1.0.1.linux-amd64/ /export/prometheus_exporter/node_exporter
vi  /usr/lib/systemd/system/node_exporter.service
[Unit]
Description=node_exporter
After=network.target

[Service]
ExecStart=/export/prometheus_exporter/node_exporter/node_exporter

[Install]
WantedBy=multi-user.target

启动

systemctl start node_exporter
systemctl enable node_exporter

alertmanager

https://github.com/prometheus/alertmanager/releases/download/v0.21.0/alertmanager-0.21.0.linux-amd64.tar.gz
mv alertmanager-0.15.2.linux-amd64/ alertmanager

创建启动文件

vi /usr/lib/systemd/system/alertmanager.service

添加如下内容：

[Unit]
Description=alertmanager
Documentation=https://github.com/prometheus/alertmanager
After=network.target
[Service]
Type=simple
ExecStart=/usr/local/alertmanager/alertmanager --config.file=/usr/local/alertmanager/alert-test.yml
Restart=on-failure
[Install]
WantedBy=multi-user.target

Alertmanager 安装目录下默认有 alertmanager.yml 配置文件，可以创建新的配置文件，在启动时指定即可。

global:
  smtp_smarthost: 'smtp.qq.com:465'
  smtp_from: '2977358239@qq.com'
  smtp_auth_username: '2977358239@qq.com'
  smtp_auth_password: 'jgigqzrlhycddhcf' # 这里是邮箱的授权密码，不是登录密码
  smtp_require_tls: false
templates:
  - '/alertmanager/template/*.tmpl'
route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 10m
  receiver: default-receiver
receivers:
- name: 'default-receiver'
  email_configs:
  - to: 'nimingkun@dingtalk.com'
    html: ''
    headers: { Subject: "[WARN] 报警邮件 test" }

smtp_smarthost：是用于发送邮件的邮箱的 SMTP 服务器地址+端口；

smtp_auth_password：是发送邮箱的授权码而不是登录密码；

smtp_require_tls：不设置的话默认为 true，当为 true 时会有 starttls 错误，为了简单这里设置为 false；

templates：指出邮件的模板路径；

receivers 下 html 指出邮件内容模板名，这里模板名为 “alert.html”，在模板路径中的某个文件中定义。

headers：为邮件标题；

3，配置告警规则

配置 rule.yml

cd /usr/local/prometheus
vim rule.yml
groups:
- name: alert-rules.yml
  rules:
  - alert: dev-database # alert 名字
    expr: up{job="dev-database"} == 0 # 判断条件
    for: 10s # 条件保持 10s 才会发出 alter
    labels: # 设置 alert 的标签
      severity: "critical"
    annotations:  # alert 的其他标签，但不用于标识 alert
      description: 服务器  已当机超过 20s
      summary: 服务器  运行状态

在 prometheus.yml 中指定 rule.yml 的路径

cat prometheus.yml 
# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - localhost:9093 # 这里修改为 localhost
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"
  - "/usr/local/prometheus/rule.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'
    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.
    static_configs:
    - targets: ['localhost:9090','localhost:9100']
  - job_name: 'dev-database'
    scrape_interval: 5s
    static_configs:
    - targets: ['192.168.50.57:9100']

重启 Prometheus 服务：

systemctl restart prometheus

4，编写邮件模板

注意：文件后缀为 tmpl

mkdir -pv /alertmanager/template/
vim /alertmanager/template/alert.tmpl
<table>
    <tr><td>报警名</td><td>开始时间</td></tr>
        <tr><td></td><td></td></tr>
</table>

5，启动 Alertmanager

systemctl daemon-reload
systemctl start alertmanager.service
systemctl status alertmanager.service

6，验证效果。

此时到管理界面可以看到如下信息：

然后停止dev-database节点上的 node_exporter 服务，然后再看效果。

systemctl stop node_exporter.service

接着邮箱应该会收到邮件：

监控Linux

在机器上安装NodeExporter,然后在Prometheus.yml配置监控地址

vi /usr/local/prometheus/prometheus.yml

  - job_name: 'dev-database'
    static_configs:
    - targets: ['192.168.50.57:9100']

在prometheus中，可以抓取的端点成为实例，通常情况下具有相同目的的实例的集合成为job。

vi /usr/local/prometheus/prometheus.yml

  - job_name: 'dev-database'
    static_configs:
    - targets: ['192.168.50.57:9100','192.168.50.58:9100','192.168.50.59:9100']

使用https://grafana.com/grafana/dashboards/11074进行监控

Grafana导入模板监控

上传json文件，选择Prometheus

监控Mysql

##下载mysql_exporter
wget https://github.com/prometheus/mysqld_exporter/releases/download/v0.12.1/mysqld_exporter-0.12.1.linux-amd64.tar.gz
tar zxvf mysqld_exporter-0.12.1.linux-amd64.tar.gz 
mv mysqld_exporter-0.12.1.linux-amd64 /usr/local/mysqld_exporter

授权连接

想要获取监控数据，需要授权程序能够连接到MySQL。

GRANT REPLICATION CLIENT, PROCESS ON *.* TO 'exporter'@'localhost' identified by '123456';
GRANT SELECT ON performance_schema.* TO 'exporter'@'localhost';
flush privileges;

注意：这里只授权了本地登陆，说明这个授权适用于mysql_exporter监控工具部署在MySQL Server上的情况，如果是部署在Prometheus Server上，则需要授权远程登陆。

创建配置信息文件

cd /usr/local/mysqld_exporter
vim .my.cnf
[client]
user=exporter
password=123456

使用systemd启动

vim /usr/lib/systemd/system/mysqld_exporter.service
 
[Unit]
Description=mysqld_exporter
After=network.target
 
[Service]
Type=simple
ExecStart=/usr/local/mysqld_exporter/mysqld_exporter --config.my-cnf=/usr/local/mysqld_exporter/.my.cnf
Restart=on-failure
 
[Install]
WantedBy=multi-user.target

加载配置并启动。

systemctl daemon-reload
systemctl start mysqld_exporter
systemctl status mysqld_exporter
systemctl enable mysqld_exporter

配置prometheus.yml添加监控目标

vi /usr/local/prometheus/prometheus.yml
  - job_name: 'mysql'
    static_configs:
      - targets: ['192.168.50.57:9104']
        labels:
          instance: db

重启服务

systemctl restart prometheus

下载模板https://grafana.com/api/dashboards/9623/revisions/4/download 导入Grafana

监控SpringBoot

添加如下依赖

<!--监控 begin-->
<dependency>
   <groupId>org.springframework.boot</groupId>
   <artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
<!--Micrometer-->
<dependency>
   <groupId>io.micrometer</groupId>
   <artifactId>micrometer-registry-prometheus</artifactId>
</dependency>
<!--监控 end-->

配置监控

spring:
  application:
    name: bounter-monitor

## 暴露所有的actuator endpoints
management:
  endpoints:
    web:
      exposure:
        include: "*"
  metrics:
    tags:
      application: ${spring.application.name}

3.打包并运行

mvn clean install
java -jar nmk0718.jar

4.配置Prometheus.yml

# SpringBoot Application
- job_name: 'bounter-monitor'
  scrape_interval: 5s
  metrics_path: '/actuator/prometheus'
  static_configs:
    - targets: ['localhost:8080']

重启Prometheus就可以在Grafana看到监控数据了

可使用https://grafana.com/grafana/dashboards/4701或https://grafana.com/grafana/dashboards/10280 模板

监控rabbitmq

下载

weget https://github.com/kbudde/rabbitmq_exporter/releases/download/v1.0.0-RC7/rabbitmq_exporter-1.0.0-RC7.linux-amd64.tar.gz

解压

tar zxvf rabbitmq_exporter-1.0.0-RC7.linux-amd64.tar.gz

运行exporter

RABBIT_USER=liangjian RABBIT_PASSWORD=liangjian360 OUTPUT_FORMAT=JSON PUBLISH_PORT=9099 RABBIT_URL=http://192.168.50.51:5672 nohup ./rabbitmq_exporter &

验证：浏览器访问 http://192.168.50.51:9099/metrics

配置监控

vi prometheus.yml

 - job_name: 'rabbitmq'
    scrape_interval: 60s
    scrape_timeout: 60s
    static_configs:
      - targets: ['192.168.50.51:9099']

配置告警

vi rule.yml
groups:
- name: alert-rules.yml
  rules:
  - alert: "rabbitmq实例失败"
    expr: up{job="rabbitmq"} == 0
    for: 5s
    labels:
      alertname: test_rabbitmq_monitor
      severity: "critical"
    annotations:
      description: "rabbitmq {{ $labels.instance }} is error"
      summary: "测试rabbitmq监控宕机"

验证

监控redis

下载地址:https://github.com/oliver006/redis_exporter/releases

[root@database opt]# tar zxvf redis_exporter-v1.13.1.linux-amd64.tar.gz 
redis_exporter-v1.13.1.linux-amd64/
redis_exporter-v1.13.1.linux-amd64/README.md
redis_exporter-v1.13.1.linux-amd64/redis_exporter
redis_exporter-v1.13.1.linux-amd64/LICENSE
[root@database opt]# cd redis_exporter-v1.13.1.linux-amd64/
[root@database redis_exporter-v1.13.1.linux-amd64]# ls
LICENSE  README.md  redis_exporter
[root@database redis_exporter-v1.13.1.linux-amd64]# nohup ./redis_exporter -redis.addr 192.168.50.51:6379 -redis.password liangjian360 &
[1] 208232
[root@database redis_exporter-v1.13.1.linux-amd64]# nohup: ignoring input and appending output to ‘nohup.out’
^C
[root@database redis_exporter-v1.13.1.linux-amd64]# netstat -lntp     
tcp6       0      0 :::9121                 :::*                    LISTEN      208232/./redis_expo

配置prometheus.yml

[root@monitor prometheus-2.23.0.linux-amd64]# vi prometheus.yml
  - job_name: 'redis'
    static_configs:
      - targets: ['192.168.50.51:9121']
      
[root@monitor prometheus-2.23.0.linux-amd64]# systemctl restart prometheus

查看Targets

配置Grafana,使用https://grafana.com/grafana/dashboards/11835

配置alertmanager

[root@monitor prometheus-2.23.0.linux-amd64]# cat rule.yml 
  - alert: "redis实例失败"
    expr: up{job="redis"} == 0
    for: 5s
    labels:
      alertname: redis_monitor
      severity: "critical"
    annotations:
      description: "redis {{ $labels.instance }} is error"
      summary: "测试redis监控宕机"

停止redis监控后,收到告警邮件