0%

Cassandra 监控方案

Cassandra 数据库的监控方案

本文主要讲解 Cassandra 数据库的监控方案,以及部署过程,
主要由 Prometheus 进行数据采集、数据存储、数据处理,而 Grafana 则是用于监控页面展示。

基本概念

Prometheus 简介

Prometheus 是一个开源的监控框架,它通过不同的组件完成数据的采集,数据的存储,告警,其中 Prometheus server 只提供了数据存储(time series data),数据的处理(提供了丰富的查询语法[查询,统计,聚合等等]),数据则通过众多的插件(prometheus 称之为exporters)来暴露一个 http 服务的接口给 Prometheus 定时抓取, 告警则通过 Altermanger。

Prometheus 目前只能通过 PULL 的方式来获取数据

## Grafana 简介 Grafana 是一个开源的度量分析与可视化套件。纯 Javascript 开发的前端工具,通过访问库(如InfluxDB),展示自定义报表、显示图表等。大多使用在时序数据的监控方面,如同Kibana类似。Grafana的UI更加灵活,有丰富的插件,功能强大。 Grafana支持许多不同的数据源。每个数据源都有一个特定的查询编辑器,该编辑器定制的特性和功能是公开的特定数据来源。

官方支持以下数据源:

  • Prometheus
  • Graphite
  • InfluxDB
  • OpenTSDB
  • MySQL
  • Elasticsearch
  • CloudWatch
  • KairosDB

监控系统部署

项目路径 /work/cassandra

1
2
3
4
5
6
# pwd
/work/cassandra
#
# ls
bin conf data doc interface javadoc lib logs pylib tools
#

监控系统部署分为以下几个步骤:

  • 准备安装包
  • 安装 JMX exporter
  • 安装 Prometheus
  • 安装 Grafana
  • 导入 Dashboard
  • 效果展示

准备安装包

Prometheus v2.17.1 下载

Grafana v6.7.2 下载

jmx_prometheus_javaagent 下载

启用 JMX exporter

启用 JMX exporter 需要在对 Cassandra 数据库集群各个节点操作。
服务器列表

  • 192.168.1.66
  • 192.168.1.67
  • 192.168.1.78
  • 拷贝 jmx_prometheus_javaagent-0.12.0.jar 文件到 ${CASSANDRA_HOME}/lib

    1
    2
    3
    4
    5
    #
    # cp jmx_prometheus_javaagent-0.12.0.jar ${CASSANDRA_HOME}/lib
    #
    # chown cassandra.cassandra ${CASSANDRA_HOME}/lib/jmx_prometheus_javaagent-0.12.0.jar
    #
  • ${CASSANDRA_HOME}/conf 新增 jmx.yaml 配置

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    lowercaseOutputLabelNames: true
    lowercaseOutputName: true
    whitelistObjectNames: ["org.apache.cassandra.metrics:*"]
    # ColumnFamily is an alias for Table metrics
    blacklistObjectNames: ["org.apache.cassandra.metrics:type=ColumnFamily,*"]
    rules:
    # Generic gauges with 0-2 labels
    - pattern: org.apache.cassandra.metrics<type=(\S*)(?:, ((?!scope)\S*)=(\S*))?(?:, scope=(\S*))?, name=(\S*)><>Value
    name: cassandra_$1_$5
    type: GAUGE
    labels:
    "$1": "$4"
    "$2": "$3"

    #
    # Emulate Prometheus 'Summary' metrics for the exported 'Histogram's.
    # TotalLatency is the sum of all latencies since server start
    #
    - pattern: org.apache.cassandra.metrics<type=(\S*)(?:, ((?!scope)\S*)=(\S*))?(?:, scope=(\S*))?, name=(.+)?(?:Total)(Latency)><>Count
    name: cassandra_$1_$5$6_seconds_sum
    type: UNTYPED
    labels:
    "$1": "$4"
    "$2": "$3"
    # Convert microseconds to seconds
    valueFactor: 0.000001

    - pattern: org.apache.cassandra.metrics<type=(\S*)(?:, ((?!scope)\S*)=(\S*))?(?:, scope=(\S*))?, name=((?:.+)?(?:Latency))><>Count
    name: cassandra_$1_$5_seconds_count
    type: UNTYPED
    labels:
    "$1": "$4"
    "$2": "$3"

    - pattern: org.apache.cassandra.metrics<type=(\S*)(?:, ((?!scope)\S*)=(\S*))?(?:, scope=(\S*))?, name=(.+)><>Count
    name: cassandra_$1_$5_count
    type: UNTYPED
    labels:
    "$1": "$4"
    "$2": "$3"

    - pattern: org.apache.cassandra.metrics<type=(\S*)(?:, ((?!scope)\S*)=(\S*))?(?:, scope=(\S*))?, name=((?:.+)?(?:Latency))><>(\d+)thPercentile
    name: cassandra_$1_$5_seconds
    type: GAUGE
    labels:
    "$1": "$4"
    "$2": "$3"
    quantile: "0.$6"
    # Convert microseconds to seconds
    valueFactor: 0.000001

    - pattern: org.apache.cassandra.metrics<type=(\S*)(?:, ((?!scope)\S*)=(\S*))?(?:, scope=(\S*))?, name=(.+)><>(\d+)thPercentile
    name: cassandra_$1_$5
    type: GAUGE
    labels:
    "$1": "$4"
    "$2": "$3"
    quantile: "0.$6"
  • ${CASSANDRA_HOME}/conf/cassandra-env.sh增加启动参数

    1
    JVM_OPTS="$JVM_OPTS -javaagent:${CASSANDRA_HOME}/lib/jmx_prometheus_javaagent-0.12.0.jar=7070:${CASSANDRA_HOME}/conf/jmx.yaml"

    如下图所示:

  • 重启 Cassandra 数据库服务

    1
    2
    3
    4
    # su cassandra
    $ cd ${CASSANDRA_HOME}
    $ ./stop.sh
    $ ./start.sh
  • 验证

    1
    2
    3
    4
    # curl 192.168.1.78:7070 > collection.txt
    % Total % Received % Xferd Average Speed Time Time Time Current
    Dload Upload Total Spent Left Speed
    100 826k 100 826k 0 0 632k 0 0:00:01 0:00:01 --:--:-- 632k

安装 Prometheus

项目路径: /work/prometheus
服务地址: 192.168.1.15:9090

1
2
3
4
5
6
# pwd
/work/prometheus
#
# ls
bin conf log
#
  • 新建 conf/prometheus.yml 文件

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    prometheus.yml 
    ---
    global:
    scrape_interval: 15s # By default, scrape targets every 15 seconds.
    evaluation_interval: 15s # By default, scrape targets every 15 seconds.
    # scrape_timeout is set to the global default (10s).
    external_labels:
    cluster: 'test-cluster'
    monitor: "prometheus"

    alerting:
    alertmanagers:
    - static_configs:
    - targets:
    - '192.168.1.234:9093'

    scrape_configs:
    - job_name: 'cassandra'
    scrape_interval: 15s
    #honor_labels: true # don't overwrite job & instance labels
    static_configs:
    - targets:
    - '192.168.1.66:7070'
    - '192.168.1.67:7070'
    - '192.168.1.78:7070'
  • 新建启动脚本 /work/prometheus/run_prometheus.sh

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    #!/bin/bash
    set -e
    ulimit -n 1000000

    DEPLOY_DIR=/work/prometheus
    cd "${DEPLOY_DIR}" || exit 1

    # WARNING: This file was auto-generated. Do not edit!
    # All your edit might be overwritten!
    exec > >(tee -i -a "/work/prometheus/log/prometheus.log")
    exec 2>&1

    exec bin/prometheus \
    --config.file="/work/prometheus/conf/prometheus.yml" \
    --web.listen-address=":9090" \
    --web.external-url="http://192.168.1.15:9090/" \
    --web.enable-admin-api \
    --log.level="info" \
    --storage.tsdb.path="/work/prometheus
  • 启动 prometheus 服务

    1
    $ sh run_prometheus.sh

    安装 Grafana

    项目路径: /work/gafana
    服务地址: 192.168.1.15:3000

    1
    2
    3
    4
    5
    6
    # pwd
    /work/gafana
    #
    # ls
    bin conf log opt
    # 注意: opt/grafana 是 Grafana 服务路径
  • 编辑 /work/gafana/conf/grafana.ini 配置文件

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    [paths]
    data = /work/gafana/data.grafana
    logs = /work/gafana/log
    plugins = /work/gafana/opt/grafana/plugins
    [server]
    http_port = 3000
    domain = 192.168.1.15
    check_for_updates = true
    [security]
    admin_user = admin
    admin_password = admin
    [log.file]
    level = info
    format = text
    [dashboards.json]
    enabled = false
    path = /work/grafana/opt/grafana/dashboards
  • 新建启动脚本 /work/prometheus/run_grafana.sh

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    #!/bin/bash
    set -e
    ulimit -n 1000000

    # WARNING: This file was auto-generated. Do not edit!
    # All your edit might be overwritten!
    DEPLOY_DIR=/work/grafana
    cd "${DEPLOY_DIR}" || exit 1

    LANG=en_US.UTF-8 \
    exec opt/grafana/bin/grafana-server \
    --homepath="/work/grafana/opt/grafana" \
    --config="/work/grafana/opt/grafana/conf/grafana.ini"
  • 启动 Grafana

    1
    # run_grafana.sh

导入 Dashboard

Cassandra-DashBoard 下载

效果展示